We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Houston, TX

SUMMARY

  • Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
  • Strong Working knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode inCloudera, Hortonworks Hadoop distributions and Amazon AWS.
  • Accomplished complex HiveQL queriesfor required data extraction from Hive tables and writtenHive User Defined Functions (UDF's)as required.
  • Good Knowledge on architecture and components ofSpark,and efficient in working withSpark Core, Spark SQL, Spark streamingandexpertise in buildingPySparkandSpark - Scalaapplications forinteractive analysis, batch processing and stream processing.
  • Proficient in convertingHive/SQL queries into Spark transformations using Data frames and Data sets.
  • Worked onHBaseto load and retrieve data for real time processing using Rest API.
  • Capable of understanding and knowledge of jobworkflow scheduling and locking tools/services likeOozie, Zookeeper, Airflow and Apache NiFi.
  • Strong knowledge in working with ETL methodsfor data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experience in configuring theZookeeper to coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Experience in configuring Spark Streamingto receive real time data from theApache Kafkaand store the stream data toHDFSand expertise in usingspark-SQLwith various data sources likeJSON, Parquet and Hive.
  • Extensively used Spark Data Frames APIoverCloudera platformto perform analytics on Hive data and also usedSpark Data Frame Operationsto perform required Validations in the data.
  • Proficient inPython Scripting and worked in stats function withNumPy, visualization using Matplotlib and Pandasfor organizing data.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Hands on experience in usingHadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
  • Experienced in working withAmazon Web Services (AWS)usingEC2for computing andS3as storage.
  • Capable of usingAWS utilitiessuch asEMR, S3and cloud watch to run and monitorHadoopandspark jobsonAmazon Web Services(AWS).
  • Ingested data intoSnowflakecloud data warehouse usingSnow pipe.
  • Extensive experience in working with micro batching to ingest millions of files onSnowflakecloud when files arrives to staging area.
  • Worked in developingImpala scriptsfor extraction, transformation, loading of data into data warehouse.
  • Experience in importing and exporting the data usingSqoopfromHDFSto Relational Database Systems and from Relational Database Systems toHDFS.
  • Experienced in designing different time driven and data driven automated workflows usingOozie.
  • Skilled in usingKerberos, Azure AD, Sentry,andRangerfor maintaining authentication and authorization.
  • Excellent knowledge in usingPartitions, bucketing conceptsin Hive and designed bothManagedandExternal tablesin Hive to optimize performance.
  • Hands on knowledge of writing code in Scala, Core Java and also with R.
  • DesignedUNIX Shell Scriptingfor automating deployments and other routine tasks.
  • Extensive experience in development ofBash scripting, T-SQL,andPL/SQLscripts.
  • Experienced in handling different file formats like Text file, Avro data files, Sequence files, Xml and Json files.
  • Proficient in relational databases like Oracle, MySQL and SQL Server. Extensive experience in working withNO SQLdatabases and its integrationDynamo DB, Cosmos DB, Mongo DB, Cassandra and HBase
  • Extensive knowledge in working with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Hands on Experience in using Visualization tools like Tableau, Power BI.
  • Knowledge in using Integrated Development environments likeEclipse, NetBeans, IntelliJ, STS.
  • Capable in working with SDLC, Agile and Waterfall Methodologies.

TECHNICAL SKILLS

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++

Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases: MySQL, SQL Server, Oracle, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase

Workflow mgmt. tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend

Cloud Technologies: Azure, AWS

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

Operating Systems: Unix, Linux, Windows

PROFESSIONAL EXPERIENCE

Senior Big Data Engineer

Confidential, Houston, TX

Responsibilities:

  • Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
  • Involved in installation of HDP Hadoop, configuration of the cluster and the eco system components like Sqoop, Pig, Hive, HBase and Oozie
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
  • Scheduled different Snowflake jobs using NiFi.
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
  • Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker and Kubernetes.
  • Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
  • Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
  • Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Experience with Snowflake Multi-Cluster Warehouses
  • Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
  • Developed Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Designed and implemented data ingestion Spark streaming framework from various data source like REST API, Kafka using Spark Streaming Scala API and Kafka.
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
  • Created Ansible playbook to automate deploying airflow on multiple EC2 instances
  • Sharing sample data using grant access to customer for UAT/BAT.
  • Developed Python, Bash scripts to automate and provide Control flow
  • Building data pipeline ETLs for data movement to S3, then to Redshift.
  • Designed and implemented effective Analytics solutions and models withSnowflake.
  • Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage
  • Installed and configured apache airflow for workflow management and created workflows in python
  • Building/MaintainingDockercontainer clusters managed by Kubernetes, Linux, Bash, GIT, Docker.
  • Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Write UDFs in Hadoop Pyspark to perform transformations and loads.
  • Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
  • Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Developed Data Quality (DQ) framework to ensure data validity and consistency for consumption by downstream applications using Spark Scala API.
  • Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
  • Installing and configuring the applications like docker tool and Kubernetes for the orchestration purpose

Environment: Hadoop MapR, Python, Spark MLlib Hive, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Jenkins, Docker, Kubernetes, Hue, AWS Services, Control-M, Tidal, Service Now, Java, Scala, Airflow, Teradata Studio, Snowflake Web UI, Snow SQL, Oracle 12c, Tableau

Big Data Engineer

Confidential, Oldsmar, FL

Responsibilities:

  • Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing.
  • Created Notebooks using Databricks, Scala and spark and capturing the data from Delta tables in Delta lakes.
  • Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker and Kubernetes.
  • Building/MaintainingDockercontainer clusters managed by Kubernetes, Linux, Bash, GIT, Docker.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Strong skills in visualization toolsPower BI, Excel - formulas, Pivot Tables, Charts and DAX Commands.
  • Worked on designing and maintaining ADF pipelines with activities - Copy, Lookup, For Each, Get Metadata, Execute Pipeline, Stored Procedure, If condition, Web, Wait, Delete etc.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity
  • Writing a Data Bricks code and ADF pipeline with fully parameterized for efficient code management.
  • Extensive experience on Azure Data Lake Analytics, Azure Data Lake Storage, AZURE Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies.
  • Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Hands-on experience on developing SQL Scripts for automation purpose.
  • Coordinate with Business Analyst and enterprise architect to convert the business requirement to technical solution.
  • Understanding the Business requirements and developing the common solutions that meets the business requirement.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Environment: Hdfs, Spark, Scala, Apache Kafka, Azure Data Factory (ADF), Power BI Desktop/Server, Azure SQL Database, Azure Databricks, Pig, SSIS, Python, Bash, Docker, Kubernetes, Airflow, Jenkins, Linux, Git.

Big Data Engineer

Confidential, Houston, TX

Responsibilities:

  • Handled the daily, weekly and monthly data feeds from different data sources and loaded them into hive tables.
  • Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks. Participated in daily scrum and other design related meetings.
  • Created staging tables, developed work flow to extracted data from different source systems in Hadoop and load data into these tables. The data from these staging table is exported using SFTP to third party system.
  • Worked with the end users on the requirements gathering.
  • Added data validations and alerts for the user specific data manipulations.
  • Worked on deploying the project on the servers using Jenkins.
  • Provided post go-live support for the existing implementations and resolve issues in production environment to ensure seamless flow of business transactions.
  • Responsible for creating the detailed design and technical documents based on the business requirements
  • Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and release cycle notes.
  • Analysed the data by performing IMPALA queries to view the transaction information and validate the data
  • Implemented Spark job to extract data from RDBMS systems which reduced the job process time.
  • Used Oozie workflow to co-ordinate pig and hive scripts.
  • Coordinating with different source teams to get list of tables, files which are in scope of project
  • Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.
  • Used SQOOP to import data from RDBMS source system and loaded data into Hive table staging table and base tables.
  • Implemented AWS Lambdafunctions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Participated in CDH updates and tested the regions once they are done
  • Handled different kinds of files types like JSON, XML, Flat Files and CSV by using appropriate SERDES or Parsing logic to load into Hive tables.
  • Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
  • Implemented test scripts to support test driven development and continuous integration.
  • Used partitioning techniques for faster performance.
  • Analysing the production jobs in case of abends and fixing the issues
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.

Environment: Hadoop, CDH, Map Reduce, Hive, Pig, SQOOP, Java, Spark, AWS, OOZIE, Python, UNIX, Jenkins, Shell scripting, DB2, Oracle, Netezza

Data Engineer

Confidential

Responsibilities:

  • Created HBase tables to load large sets of structured data.
  • Processed data into HDFS by developing solutions.
  • Created components like Hive UDFs for missing functionality in HIVE for analytics.
  • Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various.
  • Used different file formats like Text files, Sequence Files, Avro.
  • Cluster co-ordination services through Zookeeper.
  • Analyzed the data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
  • Worked extensively with HIVE DDLs and Hive Query language (HQLs).
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
  • Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
  • Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
  • Exported the analyzed data to the relational database MySQL using Sqoop for visualization and to generate reports.
  • Created Map Reduce Jobs to convert the periodic of XML messages into a partition avro Data.
  • Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS.
  • Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g.

Data Engineer

Confidential

Responsibilities:

  • Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
  • Provided business intelligence analysis to decision-makers using an interactive OLAP tool
  • Created T/SQL statements (select, insert, update, delete) and stored procedures.
  • Defined Data requirements and elements used in XML transactions.
  • Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
  • Worked to ensure high levels of Data consistency between diverse source systems including flat files, XML and SQL Database.
  • Developed and run ad-hoc Data queries from multiple database types to identify system of records, Data inconsistencies, and Data quality issues.
  • Performed Tableau administering by using tableau admin commands.
  • Involved in defining the source to target Data mappings, business rules and Data definitions.
  • Metrics reporting, Data mining and trends in helpdesk environment using Access
  • Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
  • Built reports and report models using SSRS to enable end user report builder usage.
  • Created Excel charts and pivot tables for the Ad-hoc Data pull.

Environment: SQL, PL/SQL, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP.

We'd love your feedback!