We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

Chicago, IllinoiS

SUMMARY

  • Over 8+ years of experience in Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
  • Excellent Experience in Designing, Developing, Documenting, Testing of ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.
  • Have experience in Apache Spark, Spark Streaming, Spark SQL, and NoSQL databases likeHBase, Cassandra, and MongoDB.
  • Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.
  • Good experience in Amazon Web Service (AWS) concepts like EMR and EC2 Webservices which provides fast and efficient processing of Teradata Big Data Analytics.
  • Experience in Big Data/Hadoop, Data Analysis, Data Modeling professional with applied information Technology.
  • Strong experience working with HDFS, MapReduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig and HBase.
  • Experience in usage of Hadoop distribution like Cloudera and Hortonworks.
  • Excellent knowledge of studying the data dependencies using metadata stored in the repository and prepared batches for the existing sessions to facilitate scheduling of multiple sessions.
  • Utilized analytical applications like SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
  • Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).
  • Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
  • Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
  • Large scale Hadoop environments build and support including design, configuration, installation, performance tuning and monitoring.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.
  • Integrated Kafka with Spark Streaming for real time data processing.
  • Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents.
  • Strong experience in the Analysis, design, development, testing and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
  • Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS,Map Reduce, KAFKA.
  • Expert in designing Server jobs using various types of stages like Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner and Link Collector.
  • Proficiency in Big Data Practices and Technologies like HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka.
  • Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, HBase, Python, Snowflake, S3 storage, Airflow.
  • Experience in doing performance tuning for map reduce jobs & hive complex queries.
  • Experience in efficiently doing ETL's using Spark - in memory processing, Spark SQL and Spark streaming using Kafka distributed messaging system
  • Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage.
  • Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.
  • Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.
  • Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.
  • Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau and user-filters using Tableau.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement. Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

TECHNICAL SKILLS

Big Data: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, MapReduce, PIG, SQOOP, Kafka, HBase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, PL/SQL

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL

Cloud Technologies: AWS, Microsoft Azure

Frameworks: Django REST framework, MVC, Hortonworks

ETL/Reporting: Ab Initio GDE 3.0, CO>OP 2.15,3.0.3, Informatica, Tableau

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Web/App Server: UNIX server, Apache Tomcat

Operating System: UNIX, Windows, Linux, Sun Solaris

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

PROFESSIONAL EXPERIENCE

Confidential, Chicago Illinois

Senior Big Data Engineer

Responsibilities:

  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to Kafka for future use case’s purpose.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
  • Work with region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining the subsequent phases training, UAT, staff to perform test scripts, data migration and the uplift strategy (updating of customer information to bring them to the new KYC standards) review of customer documentation.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Create data pipeline of gathering, cleaning, and optimizing data using Hive, Spark.
  • Gathering the data stored in AWS S3 from various third-party vendors, optimizing it and joining with internal datasets to gather meaningful information.
  • Combining various datasets in HIVE to generate Business reports.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement given.
  • Played a lead role in gathering requirements, analysis of entire system and providing estimation on development, testing efforts.
  • Involved in designing different components of system like Sqoop, Hadoop process involves map reduce & hive, Spark, FTP integration to down systems.
  • Have written hive and spark queries using optimized ways like using window functions, customizing Hadoop shuffle & sort parameter
  • Developed ETL's using PySpark. Used both Data frame API and Spark SQL API.
  • Using Spark, performed various transformations and actions and the final result data is saved back to HDFS from there to target database Snowflake
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka, and Flume Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
  • Used various spark Transformations and Actions for cleansing the input data
  • Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
  • Enforced standards and best practices around data catalog, data governance efforts
  • Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc.
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
  • Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
  • Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
  • Used Pyspark for extract, filtering and transforming the Data in data pipelines.
  • Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack- Elastic search and Kibana
  • Used Data Build Tool for transformations in ETL process, AWS lambda, AWS SQS
  • Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG's and dependencies between the tasks.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.
  • Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.

Environment: Hadoop, NiFi, Pig, Hive, Cloudera Manager (CDH5), S3, Kafka, Scrum, Git, Sqoop, Oozie,Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Python, Shell, XML, Unix, Snowflake, Cassandra, Informatica, and SQL Server.

Confidential, Saint Louis, MO

Azure Data Developer

Responsibilities:

  • Used Azure Data Factory extensively for ingesting data from disparate source systems.
  • Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems. Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.
  • Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
  • Designed and developed user defined functions, stored procedures, triggers for Cosmos DB.
  • Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.
  • Take initiative and ownership to provide business solutions on time.
  • Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.
  • Created DA specs and Mapping Data flow and provided the details to developer along with HLDs.
  • Created Build definition and Release definition for Continuous Integration (CI) and Continuous Deployment (CD).
  • Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
  • Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders
  • Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time.
  • Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procedures in Azure SQL Datawarehouse and Hyperscale.
  • Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub, and Service Bus Queue. Created Linked service to land the data from SFTP location to Azure Data Lake.
  • Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
  • Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
  • Extensively used SQL Server Import and Export Data tool.
  • Created database users, logins, and permissions to setup.
  • Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.
  • Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue identification, and management Addressing resource issue, Monthly one on one, Weekly meeting.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.

Confidential, San Francisco, CA

Data Engineer

Responsibilities:

  • Migrate data from on-premises to AWS storage buckets.
  • Developed a python script to transfer data from on-premises to AWS S3.
  • Developed a python script to hit REST API’s and extract data to AWS S3.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Created YAML files for each data source and including glue table stack creation.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3.
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS).
  • Created a Lambda Deployment function and configured it to receive events from S3 buckets.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • Developed Mappings using Transformations like Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Used Apache NiFi to copy data from local file system to HDP.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System (HDFS).

Environment: BigData3.0, Hadoop3.0, Oracle12c, PL/SQL, Scala, Spark-SQL, PySpark, Python, kafka1.1, SAS, Azure SQL, MDM, Oozie4.3, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig0.17, Sqoop1.4, MS Access.

Confidential

ETL Developer

Responsibilities:

  • Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.
  • Extended Hive core functionality by writing custom UDFs using Java.
  • Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from Team Center, SAP, Workday, Machine logs.
  • Experience in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HDInsight Big Data Technologies (Hadoop and Apache Spark) and Data bricks.
  • Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub, and Service Bus Queue.
  • Created Linked service to land the data from Caesars SFTP location to Azure Data Lake.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Worked on MS SQL Server PDW migration for MSBI warehouse.
  • Planning, scheduling, and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
  • Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
  • Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Developed workflows in Live compared to Analyze SAP Data and Reporting.
  • Worked on Java development and support and tools support for in house applications.
  • Developed multitude of dashboards with Power BI and depicted differing KPIs for business analysis as per business requirements.

Environment: Hadoop, Map Reduce, Hive, Azure, SQL, PL/SQL, T/SQL, XML, Informatica, Python, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.

Confidential

SQL Developer

Responsibilities:

  • Implemented the application using Agile methodology. Involved in daily scrum and sprint planning meetings.
  • Actively involved in analysis, detail design, development, bug fixing and enhancement.
  • Driving the technical design of the application by collecting requirements from the Functional Unit in the design phase of SDLC.
  • Developed Micro services using RESTful services to provide all the CRUD capabilities.
  • Creating requirement documents and design the requirement using UML diagrams, Class diagrams, Use Case diagrams for new enhancements.
  • Used JBoss application server deployment of applications.
  • Developed communication among SOA services.
  • Involved in creation of both service and client code for JAX-WS and used SOAPUI to generate proxy code from the WSDL to consume the remote service.
  • Designed the user interface of the application using HTML5, CSS3, JavaScript, Angular JS, and AJAX.
  • Worked with Session Factory, ORM mapping, Transactions and HQL in Hibernate framework.
  • Used Web services for sending and getting data from different applications using Restful.
  • Wrote client side and server-side validations using Java Scripts Validations.
  • Writing stored procedures, complex SQL queries for backend operations with the database.
  • Devised logging mechanism using Log4j.
  • GitHub has been used as a Version Controlling System.
  • Creating tracking sheet for tasks and timely report generation for tasks progress.

Environment: Java, J2EE, Java Swing, HTML, Java Script, Angular JS, Node.JS, JDBC, JSP, Servlet, UML, Hibernate, XML, JBoss, SDLC methodologies, Log4j, GitHub, Restful, JAX-RS, JAX-WS, Eclipse IDE. SQL Server 2012/2014, T-SQL, SQL Profiler, DTA, ETL, SSIS, SSRS, SSMS, SSDT.

We'd love your feedback!