We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Urbandale, IA

SUMMARY

  • Over 8+ years of IT development experience, including experience in Big Data ecosystem, and related technologies. Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
  • Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, IAM, DynamoDB, Cloud Front, Cloud Watch, Auto Scaling, Security Groups, EC2, Dynamo DB, Auto Scaling, Security Groups.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming
  • Experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data-centric solutions.
  • Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Working experience in developing applications involving Big Data technologies like Map Reduce, HDFS, Hive, Sqoop, Pig, Oozie, HBase, NiFi, Spark, Scala, Kafka and Zoo Keeper and ETL (Data Stage).
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Experience in using build/deploy tools such asJenkins, Docker and Open Shiftfor Continuous Integration & Deployment for Micro services.
  • Used Data bricks XML plug-in to parse the incoming data in the XML format, and generate the required XML as output.
  • Proficient in big data tools like Hive and Spark and relational data ware house tool Teradata etc.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.

TECHNICAL SKILLS

Operating Systems: Unix, Linux, Windows

Programming Languages: Java, Python 3, Scala 2.12.8, PySpark, C, C++

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Cluster Management & monitoring: CDH, Horton Works Ambari

Data Bases: MySQL, SQL Server, Oracle 12c, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase, KairosDB

Workflow mgmt tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend

Cloud Technologies: Azure, AWS

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

PROFESSIONAL EXPERIENCE

Confidential, Urbandale, IA

Senior Big Data Engineer

Responsibilities:

  • Developing python scripts for Redshift CloudWatch metrics data collection and automating the data points to redshift database.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations
  • Performance tuning, code promotion and testing of application changes
  • Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Experience in building Real-time Data Pipelines withKafkaConnect andSpark Streaming.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Administered Tableau server including creating User Rights Matrix for permissions and roles, monitoring report usage and creating sites for various departments
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Developed scripts for loading application call logs to S3 and used AWS Glue ETL to load into Redshift for data analytics team
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Installing, configuring and maintaining Data Pipelines
  • UtilizedAnsible playbookfor code pipeline deployment
  • UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.

Environment: Cloudera Manager (CDH5), AWS, S3, EC2, Redshift, Cloudwatch, Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, Kafka, SSIS Snowflake, PyCharm, Scrum, Git, Sqoop, HBase, Informatica, SQL, Python, XML, Oracle, MS SQL, T-SQL, MongoDB, DB2, Tableau, Unix, Shell Scripting.

Confidential, Bridgeton, MO

Big Data Engineer

Responsibilities:

  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Validate Sqoop jobs, Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce and store into HDFS (Horton works).
  • Extracted and restructured the data into MongoDB using import and export command line utility tool.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka
  • Developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Deploying Azure Resource Manager JSON Templates from PowerShell worked on Azure suite: Azure SQL Database, Azure Data Lake, Azure Data Factory, Azure SQL Data Warehouse, Azure Analysis Service
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
  • Closely work on pub-sub model as well because of Lamda model we implemented in tcf bank.
  • Design & implement Spark Sql tables, Hive scripts job with stone branch for scheduling and create work flow and task flow.
  • Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Responsible in development ofSpark Cassandraconnector to load data from flat file to Cassandra for analysis
  • Got chance working onApache NiFilike executingSpark script, Sqoop scripts throughNiFi, worked on creating scatter and gather pattern inNiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a customNiFiprocessor for filtering text from Flow files etc
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra

Environment: Spark, Redshift, Python, Azure, Data Bricks, Data Lake, Data Storage, Data Factory, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie.

Confidential, Charlotte, NC

Big Data Engineer

Responsibilities:

  • Worked onAmazon AWS conceptslikeEMRandEC2 web servicesfor fast and efficient processing ofBig Data.
  • Generated parameterized queries for generating tabular reports usingglobal variables, expressions, functions,andstored proceduresusingSSRS.
  • Designed and DevelopedPL/SQL procedures, functionsandpackagesto create Summary tables.
  • Worked on Performance Tuning of the database which includes indexes, optimizingSQL Statements.
  • Fixing the Load balancing issues ofDatastageJobs and Database Jobs on Server.
  • Createddata modelsforAWS Redshiftand Hive fromdimensional data models.
  • Involved in selecting and integrating anyBig Data toolsand frameworks required to provide requested capabilities
  • Developed Pig scripts andUDF'sas per the Business logic.
  • Developing a new architecture for the project which uses less infrastructure and costs less, by converting the data load jobs to read directly from on premise data sources.
  • Executed change management processes surrounding new releases ofSAS functionality
  • Preparedcomplex T-SQL queries, viewsandstored proceduresto load data into staging area.
  • Participated indata collection, data cleaning, data mining,developing models and visualizations.
  • Worked withSqoopto transfer data between theHDFSto relational database likeMySQLandvice versaand experience in using ofTalendfor this purpose.
  • Installed and configured Horton works HADOOP from scratch for development and HADOOP tools like Hive,HBASE, SQOOP, ZOOKEEPERand FLUME.
  • Built and Deployed Industrial ScaleData Lakeon premise and Cloud platforms.
  • UsedSSISandT-SQL stored proceduresto transfer data fromOLTP databasesto staging area and finally transfer intodata-mart.
  • ExtractedTablesand exported data fromTeradatathroughSqoopand placed inCassandra.
  • Worked on analyzing and examining customer behavioral data usingMongoDB.
  • Enforced referential integrity in theOLTP data modelfor consistent relationship between tables and efficient database design.
  • DevelopedSparkand Java applications for data streaming and data transformation
  • Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into the internal cache system.
  • Developed a building block code (Common feed parser) that can work with anyKafkatopics and data formats like XML andJSONwith a driver program.
  • DesignedData StageETL jobs for extracting data from heterogeneous source systems, transform and finally load into Data Warehouse
  • Worked as a Data Engineer designed and Modified Database tables and usedHBaseQueries to insert and fetch data from tables.
  • CreatedHive External tablesto stage data and then move the data from Staging to main tables
  • Created jobs and transformation in PentahoData Integrationto generatereportsand transfer data fromHBasetoRDBMS.
  • WroteDDLandDML statementsfor creating, altering tables and converting characters into numeric values.
  • Worked onMaster Data Management (MDM)Hub and interacted with multiple stakeholders.
  • Worked onKafkaand Storm to ingest the real time data streams, to push the data to appropriateHDFSorHBase.
  • Extensively involved in development and implementation ofSSISandSSAS applications.
  • Collaborated withETL,andDBA teamsto analyze and provide solutions to data issues and other challenges while implementing theOLAP model.
  • Developed data pipeline using Pig and Hive from Teradata,DB2 datasources. These pipelines had customizedUDF'Sto extend the ETL functionality.
  • Worked withOLTPto find the daily transactions and type of transactions occurred and the amount of resource used
  • Developed aConceptual ModelandLogical ModelusingErwinbased onrequirements analysis.

Environment: Hadoop, HDFS, HBase, SSIS, SSAS, OLAP, Hortonworks, Data lake, OLTP, ETL, Java, ANSI-SQL, AWS, SDLC, T-SQL, SAS, MySQL, Big Integrate, HDFS, Sqoop, Cassandra, MongoDB, Hive, SQL, PL/SQL, Teradata, Oracle 11g, MDM

Confidential

Data Engineer

Responsibilities:

  • As a Data Engineer, my role includes analyzing and evaluating the business rules, data sources, data volume and come up with estimation, planning and execution plan to ensure architecture meets the business requirements
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Using Sqoop to channel data from different sources of HDFS and RDBMS.
  • To meet specific business requirements wrote UDF's in Scala and Store Procedures Replaced the existing Map Reduce programs and Hive Queries into Spark application using Scala
  • Created several types of data visualizations using Python and Tableau.
  • Extracted Mega Data from AWS using SQL Queries to create reports.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
  • Responsible for data cleansing from source systems using Ab Initio Components such as join, dedup sorted, De normalize, Normalize, Reformat, Filter by expression, Rollup
  • Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis pertaining to Loan products.
  • Gathered Data from Help Desk Ticketing System and write ad-hoc reports and, charts and graphs for analysis.
  • Worked to ensure high levels of Data consistency between diverse source systems including flat files, XML and SQL Database.
  • Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
  • Built reports and report models using SSRS to enable end user report builder usage.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.

Environment: Apache Spark, Hadoop, Spark-SQL, AWS, Java, Scala, MapReduce, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL Linux Shell Scripting.

Confidential

Data Engineer

Responsibilities:

  • Involved in various phases of development Analyzed and developed the system going through Agile Scrum methodology.
  • Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
  • Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
  • Built pipelines to move hashed and un-hashed data from XML files to Data lake.
  • Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.
  • Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
  • Knowledge on handling Hive queries using Spark SQL that integrate with Spark environment.
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework

Environment: Hadoop, Java, MapReduce, HBase, JSON, Spark JDBC, Hive, JSON, Pig.

We'd love your feedback!