Senior Data Engineer Resume New York City, NY - Hire IT People

SUMMARY

Over 8 years of professional experience in IT which includes work experience inBig Data, and Hadecosystem - relatedated technologies.
4+ years of experience in Cloud platform (AWS).
Good experience withNoSQLdatabasesHBase, MongoDBand good understanding ofCassandra.
Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
Good knowledge in streaming applications usingApacheKafka.
Expertise is designing, implementing, and optimizing unified logging system or message brokers such as Apache Kafka, Amazon Kinesis streams.
Extensively used out of the box Kafka processors available in Nifi to consume data from Apache Kafka specifically built against the Kafka consumer API.
Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
Hands on experience in Performance Tuning of transformations.
Good communication skills, work ethics and the ability to work in a team efficiently with good leadership skills.
Experience in managing Hadoop clusters usingCloudera Manager tool.
Hands on experience with data acquisition into Hadoop cluster usingSqoopandFlume.
Experience in designing both time driven, and data driven automated workflows and ETL pipelines usingvarious cloud technologies like AWS, Azure, and AWS Databricks.
Experience in analysing data usingSpark SQL,HIVEQL,PIGLatin,PySpark/Scalaand customMapReduceprograms in Java.
ExtendingHIVEandPIGcore functionality by using customUDF’s.
Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
Worked extensively on building Nifi data pipelines in docker container environment in development phase.
Worked with Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
Consume the data from the client APIs of the servers located on AWS, ingest it, transform, and load into Enterprise Data Hub using the Apache Nifi processors like Invoke AWS Gateway API.
Good experience in complete project life cycle (design, development, testing and implementation) of Client Server andWeb applications, ETL/ELT pipelines, and Cloud migration from On-Prem.
Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modelling and data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.
Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, EMR, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, AWS GLUE and other services of the AWS family.
Experience in bigdata using AWS Redshift with technologies like AWS Athena, Quick sight.
Selecting appropriate AWS services to design and deploy an application based on given requirements.
Experience on Cloud Databases and Data warehouses (SQL Azure and Confidential Redshift/RDS ).
Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/data marts from heterogeneous sources. Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, Spark-SQL, and Pig.
Well versed in Installation, Configuration, Supporting and Managing of Big Data and Underlying infrastructure ofHadoopCluster.
Strong experience with big data processing using Hadoop technologiesMap Reduce, Apache Spark, Apache Crunch, Hive, Pig and Yarn.
Hands on Experience in application developments likeJava,RDBMS,andUNIXShellScripting.
Experience in Web Services usingXML, HTML,Ajax, jQuery,andJSON.
Hands on experience inJ2SE, J2EE, JSP, Servlets, EJB, WebLogic, WebSphere, Tomcat, JDBC, PythonandJava Script.
Supported implementation of social media tools to drive new reporting solutions and improve our ability to operate efficiently.
Hands on experience working with different file formats like JSON, CSV, Avro, Delta, Parquet etc. using AWS Databricks and Azure Data Factory to Transform the data and implement various DataLake/Delta Lake Architectures for better reusability, Scalability, and fully automated pipelines.
Detail-oriented and result driven Professional with experience in populating and maintaining Enterprise Data Warehouse and subject area specific Data Marts using IBM DataStage ETL tool, Pentaho ETL tools.
Structured and Unstructured data feeds, including Big data, for Financial, Insurance, Media, Retail applications with emphasis on Data Analysis, Business Logics and Data Quality.
Experience in communicating with Vendors for procuring data as well as sharing the respective data in their SFTP for respective, Daily/Weekly/Biweekly/Monthly running jobs.

TECHNICAL SKILLS

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, CassandraOozie, Kubernetes.

Spark Streaming Technologies: Spark, Kafka, Storm, NiFi, Kinesis

AWS tools: EC2, S3 Bucket, AMI, RDS, Redshift, GLUE, Kinesis, Quick sight, Sage maker

Programming Languages: Java, SQL, Pyspark, Python, HTML5, Ruby

Databases: Data warehouses, DynamoDB, NoSQL, Oracle, Postgres, MySQL, Microsoft SQL Server

Tools: Eclipse, Jupiter, Google Shell, MS Visio, Microsoft Azure HDInsight, Airflow

Automation Testing Tools: Cucumber, Selenium

Methodologies: Agile - Scrum Jira, Rally, Octane

Operating Systems: Unix/Linux

Machine Learning Skills: Feature Extraction, Dimensionality Reduction, Model Evaluation, K-means, Regressions.

PROFESSIONAL EXPERIENCE

Confidential, New York City, NY

Senior Data Engineer

Responsibilities:

Designed and developed ETL integration patterns using Pyspark and Python in Databricks.
Developed framework for converting existing SharePoint sourced Excel spreadsheets and mappings and to PySpark (Python and Spark) Jobs.
Create Pyspark framework to bring data from various sources like SFTP, DBMS, to Amazon S3.
Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
Build the infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources like Salesforce, SQL Server, Oracle & SAP using Azure, Spark, Python, Hive, Kafka and other Bigdata technologies.
Data QA/QE for data transfer and data lake or data warehouse.
Build analytics tools that utilize the data pipeline to provide actionable insights into HR analytics, operational efficiency and other key business performance metrics.
Migrating on-prem ETLs from MS SQL server, Pentaho jobs to AWS Cloud using Databricks.
Enhanced and optimized Data Pipelines using Reusable frameworks to support data need for the HR analytics and Payroll teams using Spark and Kinesis streaming services.
Responsible for architecting a complex data layer to source the Raw data from variety of different sources and generating a derived data as per the business requirement and feed the data to BI Reporting to data scientist teams.
Worked on reading and writing multiple data formats like JSON, ORC, Parquet, Delta in Databricks using PySpark.
Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Dataframe from the required files in HDFS.
Responsible for data ingestion into Bigdata using Spark Streaming and Kinesis
Worked in an Agile Environment and indulged in Design Review and End-to-End UATs and assisted QA in automating test cases.
Unit testing, UAT testing and End-to-End automation design reviews.
Acted as Lead to migrate On-Prem Big Data infrastructure to AWS Cloud.
Design and documented the Solution architect diagram for the dataflow and services to be used in cloud
Used AWS Glue for the data transformation, validate and data cleansing. Used python Boto 3 to configure the services AWS glue, EC2, S3.
Developed the snow sql scripts to load the final core tables from the stage tables
Create external tables with partitions using Hive, AWS Athena and Redshift.
Involved in complete BigData flow of the application starting from data ingestion from upstream to HDFS, processing the data in HDFS and analysing the data.
Designed and implemented a real-time data pipeline to process semi structured data by integrating 150 million raw records from 30+ data sources using Kafka and PySpark and stored processed data in Redshift.

Environment: s:AWS Databricks, Hive, AWS Redshift, AWS S3, Apache Airflow, Scala, Python, Pyspark, MS-SQL, Kubernetes, AWS Lambda,AWS Cloudwatch, Alteryx, Pentaho ETL, Dbeaver, Trino/Presto clusters, REST API calls, Kinesis Streaming, Workday, Kronos, Kissflow, Zepplin, Zendek API, Vault management, Databricks Secrets, AWS CLI

Confidential, Milwaukee, WI

Senior AWS Data Engineer

Responsibilities:

Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloudwatch.
Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket.
Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements.
Used AWS Glue for the data transformation, validate and data cleansing. Used python Boto 3 to configure the services AWS glue, EC2, S3.
Designed and implemented a real-time data pipeline to process semi structured data by integrating 150 million raw records from 30+ data sources using Kafka and PySpark and stored processed data in Redshift.
Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
Creating pipelines, data flows and complex data transformations and manipulations using Azure Data Factory(ADF) and PySpark with Databricks. Also created, provisioned multiple Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
I usually code the application in Scala using IntelliJ Later using SBT Scala I will be creating a JAR file where this JAR file is submitted to Spark and the Spark- submit Job starts running.
Using Talend making the data available on cloud for offshore team. Using Last Processed Date as a time stamp, I usually run the job in daily manner and this automation job completely done on YARN cluster.
Automating the data flow using Nifi, Accumulo and the ControlM.
Led the migration from Oracle to Redshift using Amazon Athena and S3 resulting in an annual cost savings of $900,000 and an increase in performance of 14%.
Have designed and developed ETL mapping for data collection from various data feeds using REST API.
Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
Implemented the Land Process of loading the customer/product Data Set into MDM using NiFi from various source systems
Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
Developed ELT workflows using NiFI to load data into Hive and Teradata.
Worked on Migrating jobs from NiFi development to Pre-PROD and Production cluster.
Scheduled different Snowflake jobs using NiFi.
Used NiFi to ping snowflake to keep Client Session alive.
Good working experience on submitting the Spark jobs which shows the metrics of the data which is used for Data Quality Checking.
Responsible for Designing and configuring Network Subnets, Route Tables, Association of Network ACLs to Subnets and Open VPN.
Responsible for Account management, IAM Management and Cost management.
Designed AWS Cloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
Performed bulk load of JSON data from s3 bucket to snowflake. Used Snowflake functions to perform semi structures data parsing entirely with SQL statements.
Worked on SnowSQL and Snowpipe, and converted Talend Joblets to support the snowflake functionality. Created Snowpipe for continuous data load.
Act as technical liaison between customer and team on all AWS technical aspects.
Lead in Installation, integration and configuration of Jenkins CI/CD, including installation of Jenkins plugins.

Environment: s:AWS, Hive, Informatica, Talend, AWS Redshift, AWS S3, Apache Airflow, ControlM, Databricks, Apache Nifi, Python, MS-SQL, Amazon Web Services, Snowflake, DataStage 11.5

Confidential, Virginia Beach, VA

Data Engineer

Responsibilities:

Writing MapReduce code using pythonin order to get rid of certain security issues in the data.
Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
Used Pig Latin at client-side cluster and HiveQL at server-side cluster.
Importing the complete data from RDBMS to HDFS cluster usingSqoop.
Creating external tables and moving the data onto the tables from managed tables.
Performing the subqueries in Hive.
Solid experience in integration of various data sources with Multiple Databases like Teradata, Netezza,Oracle 11g, MS SQL Server and COBOL/DB2/VSAM, JCL, PROCS, ENDEVOR package, CA7 scheduler tool, IBM Cognos, Tableau
Partitioning and Bucketingthe imported data using HiveQL.
Implemented a batch process to load the heavy volume data loading using Apachi Dataflow framework using Nifi in Agile development methodology.
Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics.
Partitioning dynamically using dynamic partition insert feature.
Moving this partitioned data onto the different tables as per as business requirements.
Invoking an externalUDF/UDAF/UDTF python scriptfrom Hive using Hadoop Streaming approach which is supported byGanglia.
Automated ETL processes across billions of rows of data which reduced manual workload by 33% monthly.
Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
Deployed the Big DataHadoop applicationusingTalendon cloudAWS (Amazon Web Services) and on Microsoft Azure.
Implemented Data Lake to consolidate data from multiple source databases such as Exadata, Teradata using Hadoop stack technologies SQOOP, HIVE/HQL.
Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics.
Involved in Designing and Developing Enhancements product features.
Involved in Designing and Developing Enhancements of CSG using AWS APIS.
Enhance the existing product with newly features like User roles (Lead, Admin, Developer), ELB, Auto scaling, S3, Cloud Watch, Cloud Trail and RDS-Scheduling.
Extracting data fromdata warehouse (Teradata)on to the SparkRDD’s
Experience onSpark with Scala/Python.
Working onStateful TransformationsinSpark Streaming.
Good hands-on experience onLoading data onto Hive from Spark RDD’s.

Environment: s:HDFS cluster, Hive, Apache Nifi, Pig, Sqoop, Oozie, MapReduce, Talend, Python.

Confidential

Data Analyst

Responsibilities:

Google Analytics API, and Salesforce API using Python to create data views to be used in BI tools like Tableau.
Working with two different datasets one using HiveQL and other using Pig Latin.
Experience on moving the raw data between different systems using Apache Nifi.
Automating the data flow process using Nifi.
Also, hands-on experience on tracking the data flow in a real time manner using Nifi.
Implemented Data Lake to consolidate data from multiple source databases such as Exadata, Teradata using Hadoop stack technologies SQOOP, HIVE/HQL.
Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics.
Partitioning dynamically using dynamic partition insert feature.
Moving this partitioned data onto the different tables as per as business requirements.
Invoking an external UDF/UDAF/UDTF python script from Hive using Hadoop Streaming approach which is supported by Ganglia.
Automated ETL processes across billions of rows of data which reduced manual workload by 33% monthly.
Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Services) and on Microsoft Azure.
Involved in Designing and Developing Enhancements product features.
Involved in Designing and Developing Enhancements of CSG using AWS APIS.
Enhance the existing product with newly features like User roles (Lead, Admin, Developer), ELB, Auto scaling, S3, Cloud Watch, Cloud Trail and RDS-Scheduling.
Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS.
Involved in Designing the SRS with Activity Flow Diagrams using UML.
Employed Agile methodology for project management, including tracking project milestones; gathering project requirements and technical closures; planning and estimation of project effort; creating important project related design documents and identifying technology related risks and issues.
Implemented a batch process to load the heavy volume data loading using Apachi Dataflow framework using Nifi in Agile development methodology.
Implemented Data Lake to consolidate data from multiple source databases such as Exadata, Teradata using Hadoop stack technologies SQOOP, HIVE/HQL.
Experience on Spark with Scala/Python.
Working on Stateful Transformations in Spark Streaming.
Good hands-on experience on Loading data onto Hive from Spark RDD's.
Worked on Spark SQL UDF's and Hive UDF's.
Worked with Spark accumulators and broadcast variables.

Environment: SAS, SQL, Teradata, Oracle, PL/SQL, UNIX, XML, Python, AWS, SSRS, TSQL, Hive, Sqoop

We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

New York City, NY

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship