We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Blue Ash, OH

SUMMARY

  • Over 8+ years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
  • Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics.
  • Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
  • Skilled in Tableau Desktop versions 10x for data visualization, Reporting and Analysis.
  • Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, Calculated fields, Sets, Groups, Parameters etc., in Tableau.
  • Hands on learning with different ETL tools to get data in shape where it could be connected to Tableau through Tableau Data Extract.
  • Hands on experience in installing, configuring Hadoop ecosystems such asHDFS, MapReduce, Yarn, Pig, Hive, HBase, Oozie, Sqoop, flumeandKafka.
  • Excellent knowledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
  • Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
  • Experience in Creating ETL mappings using Informatica to moveData from multiple sources like Flat files, Oracle into a common target area such asData Warehouse
  • Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
  • Segmentation Analysis, Regression Models, and Clustering.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
  • Selecting appropriate AWS services to design and deploy an application based on given requirements.
  • Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
  • Experience in collection of Log Data and JSON data into HDFS using Flume and processed the data using Hive/Pig.
  • Experience with building data pipelines using Azure Data Factory, Azure Databricks, and stacking data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and concede database access.
  • Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.
  • Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportthrough use of ETL tools such as Informatica.
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Experience in working with SAS Enterprise Guide Software for reporting and analytical tasks.
  • Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction using Oracle and Teradata.
  • ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala.
  • Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
  • Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
  • Expertise in working with Linux/Unix and shell commands on the Terminal.
  • Extensively used Microservices and Postman for hitting Hadoop clusters.
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience working with Hortonworks and Cloudera environments.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
  • Experience in using build/deploy tools such asJenkins, Docker and OpenShiftfor Continuous Integration & Deployment for Microservices.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS (Amazon Web Services), Microsoft Azure

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

Databases: Oracle 12c/11g, Teradata R15/R14.

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Blue Ash, OH

Senior Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Installing, configuring and maintaining Data Pipelines
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Created multiple dashboards in tableau for multiple business needs.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Start working with AWS for storage and halding for tera byte of data for customer BI Reporting tools
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Created a Serverless data ingestion pipeline on AWS usingMSK(Kafka)and lambda functions.
  • Developed applications using Java that reads data from MSK(kafka) and writes it toDynamo DB.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Worked on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
  • Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.

Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Git, Sqoop, Oozie. Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix

Confidential, Chester Field, MO

Big Data Engineer

Responsibilities:

  • Experienced in installing, configuring and using Hadoop Ecosystem components.
  • Experienced in Importing and exporting data into HDFS and Hive usingSqoop.
  • Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on the Hive data. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Data bricks.
  • Worked in installing cluster,commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, and slots configuration.
  • Developing python scripts for Redshift CloudWatch metrics data collection and automating the data points to redshift database.
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Installing IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND).
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Queried and analyzed data fromDataStaxCassandrafor quick searching, sorting and grouping.
  • Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
  • Designed and implemented a product search service usingApache Solr/Lucene.
  • Developed Map Reduce programs in Java for parsing the raw data and populating staging Tables.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Installed and configured Hive and writtenHive UDFsand Used Map Reduce and Junit for unit testing.
  • UsedDataStax Cassandraalong with Pentaho for reporting.
  • Experienced in running query-usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing
  • Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis.
  • Integrated Kafka with Spark Streaming for real time data processing
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL
  • Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • UsedYarn Architecture and Map reduce in the development cluster for POC.
  • Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
  • Load and transform large sets of structured, semi structured and unstructured data.

Environment: CDH, Map Reduce, HDFS, Hive, Kafka, pig, Impala, Yarn, Azure, Databricks, Data Lake, HDInsight, Data Factory, Hbase, Cassandra, spark, Solr, Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Teradata, CentOS, Pentaho, Websphere, MySQL, Python, Shell, Git

Confidential, Springfield, MA

Big Data Engineer

Responsibilities:

  • Plan, design, and implement application database code objects such as stored procedures and views
  • Performed data manipulation on extracted data using Python Pandas.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Used Hive Queries in Spark-SQL for analysis and processing the data.
  • Implemented Hive UDF's and did performance tuning for better results
  • Tuned, and developed SQL on HiveQL, Drill and SparkSQL
  • Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
  • Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
  • Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access.
  • Wrote TSQL like Indexes, Views, Stored Procedures and Triggers
  • Involved in Database migrations from legacy systems, SQL server to Oracle
  • Involved in database testing, writing complex SQL queries to verify the transactions and business logic like identifying the duplicate rows by using SQL Developer and PL/SQL Developer
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
  • Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy.
  • Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Provided database coding to support business applications using T-SQL
  • Performed quality assurance and testing of SQL server environment
  • Developed new processes to facilitate import and normalization, including data files
  • Interacted with report owners to establish/clarify their requirements and developed report specifications
  • Developed SQL Queries, Python, R programs, to fetch complex data from different tables in remote databases using joins, database links and Bulk collects
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
  • Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Worked in writing Spark SQL scripts for optimizing the query performance
  • Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming

Environment: Spark SQL, Hive, Hadoop Yarn, AWS, Map Reduce, Hive QL, Sqoop, SQL Server, No SQL Databases, Python, Scala, Shell, Bash Scripting, Git.

Confidential

Data Engineer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: MapReduce, Spark, Hive, Pig, Sqoop, AWS, HBase, XML PL/SQL, Sql, HDFS, Unix, Python, SAS, PySpark, Redshift, Oozie, Impala, Kafka, JSON Shell Scripting.

Confidential

Hadoop Developer

Responsibilities:

  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
  • Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive.
  • Supported revenue management using statistical and quantitative analysis, developed several statistical approaches and optimization models.
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Developed testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop Log files.
  • Used Sqoop to transfer data between relational databases and Hadoop.

Environment: Spark, Java, Python, Jenkins, HDFS, Sqoop, Hadoop, JSON, Hive, Sqoop, Oozie, Git

We'd love your feedback!