We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • Over 9+ years of IT industry experience in all phases of Software Development Life Cycle (SDLC) with skills in Hadoop Development, Big Data Engineering/ Data Engineering, design, development, testing and deployment of software systems
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Expertise in Data Reporting, Ad - hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
  • Solid experience in usingHadoop ecosystemcomponents likeHadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
  • Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Strong background in various Data modeling tools using ERWIN, ER Studio and Power Designer.
  • Extensively usedSpark Data Frames APIoverCloudera platformto perform analytics on Hive data and also usedSpark Data Frame Operationsto perform required Validations in the data.
  • Good Knowledge on architecture and components ofSpark,and efficient in working withSpark Core, Spark SQL, Spark streamingandexpertise in buildingPySparkandSpark-Scalaapplications forinteractive analysis, batch processing and stream processing.
  • DesignedUNIX Shell Scriptingfor automating deployments and other routine tasks.
  • Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
  • Accomplished complexHiveQL queriesfor required data extraction fromHive tablesand writtenHive User Defined Functions (UDF's)as required.
  • Strong knowledge in working withETL methodsfor data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Extensive knowledge in working withAzurecloud platform(HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Ingested data intoSnowflakecloud data warehouse usingSnow pipe.
  • Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode inCloudera, Hortonworks Hadoop distributions and Amazon AWS.
  • Extensive experience in working with micro batching to ingest millions of files onSnowflakecloud when files arrives to staging area.
  • Capable of understanding and knowledge of jobworkflow schedulingand locking tools/services likeOozie, Zookeeper, Airflow and Apache NiFi.
  • Experience in importing and exporting the data usingSqoopfromHDFSto Relational Database Systems and from Relational Database Systems toHDFS.
  • Proficient in relational databases likeOracle, MySQLandSQL Server. Extensive experience in working withNO SQLdatabases and its integrationDynamo DB, Cosmos DB, Mongo DB, Cassandra and HBase
  • Hands on Experience in usingVisualization toolslikeTableau, Power BI.
  • Experience in configuringSpark Streamingto receive real time data from theApache Kafkaand store the stream data toHDFSand expertise in usingspark-SQLwith various data sources likeJSON, Parquet and Hive.
  • Proficient in convertingHive/SQL queriesintoSpark transformationsusingData frames and Data sets.
  • Knowledge in using Integrated Development environments likeEclipse, NetBeans, IntelliJ, STS.

TECHNICAL SKILLS

Big Data Tools: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper

Languages: PL/SQL, SQL, Scala, Python, PySpark, Java, C, C++, Shell script, Perl script

BI Tools: SSIS, SSRS, SSAS.

Modeling Tools: IBM Infosphere, SQL Power Architect, Oracle Designer, Erwin 9.6/9.5, ER/Studio 9.7, Sybase Power Designer.

Cloud Technologies: AWS and Azure

Database Tools: Oracle, My SQL, Microsoft SQL Server, Teradata, Mongo DB, Cassandra, HBase

ETL Tools: Pentaho, Informatica Power 9.6, SAP Business Objects XIR3.1/XIR2, Web Intelligence.

Reporting Tools: Business Objects, Crystal Reports.

Tools: & Software TOAD, MS Office, BTEQ, Teradata SQL Assistant.

Operating System: Windows, Dos, Unix, Linux.

Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio and MS Office, have worked on C++, UNIX, PL/SQL etc.

PROFESSIONAL EXPERIENCE

Confidential, ADA, MI

Senior Big Data Engineer

Responsibilities:

  • Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
  • Develop near real time data pipeline usingspark
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions
  • Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
  • Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
  • Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using python map reduce.
  • Imported documents into HDFS, HBase and creating HAR files.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics.
  • Hands on coding - Write and test the code for the Ingest automation process - Full and Incremental Loads. Design the solution and develop the program for data ingestion using - Sqoop, map reduce, Shell script & python
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Developed application to refresh Power BI reports using automated trigger API
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • UsedCassandraCQL with Java API’s to retrieve data from Cassandra tables.
  • Created various Parser programs to extract data from Autosys, Tibco Business Objects, XML, Informatica, Java, and database views using Scala.
  • Integrated and automated data workloads to Snowflake Warehouse.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • ImplementedSparkusingScalaandSpark SQLfor faster testing and processing of data.
  • Experience implementing machine learning back-end pipeline with Pandas, NumPy
  • Worked in creating POCs for multiple business user stories using Hadoop ecosystem.
  • Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
  • Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.
  • Worked on developing Oozie jobs to create HAR files
  • Develop near real time data pipeline using flume, Kafka and spark stream to ingest client data from their web log server and apply transformation
  • Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java
  • Encoded and decoded Json objects using PySpark to create and modify the dataframes in Apache Spark
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
  • Implemented Spark Scripts usingScala,Spark SQLto access hive tables into spark for faster processing of data
  • Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Worked on UDFS using Python for data cleansing
  • Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in in Azure Databricks
  • Worked with hundreds of terabytes of data collections from different loan applications into HDFS.
  • Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Written an Impala Meta store sync up in Scala for two separate clusters sharing the same metadata.
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.

Environment: Agile, Power BI, Azure, Azure Data Bricks, Azure Data Factory, Azure Data Lake, Hadoop, Hortonworks, Snowflake, HDFS, Solr, HAR, HBase, Oozie, Scala, Python, SOAP API webservices, Java, Weblogic, Tableau, Apache airflow, Jira

Confidential, Charlotte, NC

Big Data Engineer

Responsibilities:

  • Used Agile methodology in developing the application, which included iterative application development, weekly Sprints, stand up meetings and customer reporting backlogs.
  • Understanding of AWS Product and Service suite primarily EC2, S3, VPC, Lambda, Redshift, Spectrum,
  • Created and managed cloud VMs with AWS EC2 Command line clients and AWS management console.
  • Migrated on premise database structure to Confidential Redshift data warehouse. Worked on AWS Data Pipeline to configure data loads from S3 into Redshift
  • Involved in data migration to snowflake using AWS S3 buckets.
  • Map Reduce jobs in Python for data cleaning and data processing.
  • Extracting batch and Real time data from DB2, Oracle, Sql server, Teradata, Netezza to Hadoop (HDFS) using Teradata TPT, Sqoop, Apache Kafka, Apache Storm.
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Design and build ETL workflows, leading the efforts of programming data extraction from various sources into Hadoop file system, implement end to end ETL workflows using Teradata, SQL, TPT, SQOOP and load to HIVE data stores
  • Design the incremental, historical extract logic to load the data from flat files into Massive Event Logging Database (MELD) from various servers.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Spark and loaded data into HDFS.
  • Monitoring resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined.
  • Used PySpark to expose Spark API to Python.
  • UsedAWS S3 Bucketsto store the file and injected the files into Snowflake tables usingSnow Pipeand run deltas usingData pipelines.
  • Developed Talend jobs to populate the claims data to data warehouse - star schema, snowflake schema, Hybrid Schema
  • Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
  • Built different visualizations and reports in tableau using Snowflake data.
  • Loading, analyzing and extracting data to and from Elastic Search with Python.
  • Assist with the analysis of data used for the tableau reports and creation of dashboards.
  • Design and implement large scale distributed solutions in AWS.
  • Analyze and develop programs by considering the extract logic and the data load type using Hadoop ingest processes using relevant tools such as Sqoop, Spark, Scala, Kafka, Unix shell scripts and others.
  • Writing code and creating hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
  • Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues
  • Developing Apache Spark jobs for data cleansing and pre-processing.
  • Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
  • Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
  • Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Used ORC and Parquet file formats in Hive.
  • Development of efficient pig and hive scripts with joins on datasets using various techniques.
  • Write documentation of program development, subsequent revisions and coded instructions in the project related GitHub repository
  • Writing spark programs to improve the performance and optimization of the existing algorithms in Hadoop using spark context, spark-sql, data frame, pair RDD's, spark yarn.
  • Using Scala language to write programs for faster testing and processing of data.

Environment: RHEL, HDFS, Python, Django, Flask, Pyspark, Map-Reduce, Hive, Snowflake, AWS, EC2, S3, Lambda, Redshift, Pig, Sqoop, Oozie, Teradata, Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala.

Confidential, Menlo Park, CA

Data Engineer

Responsibilities:

  • Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Implemented reporting in PySpark, Zeppelin, Jupyter, & querying using Airpal, Presto & AWS Athena.
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, AWS Lambda.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
  • Developed Data mapping
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, AWS, Kafka, JSON, XML PL/SQL, SQl, HDFS, Unix, Python, PySpark

Confidential

Hadoop Developer

Responsibilities:

  • Developed a python script to transfer data from on-premises to AWS S3
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Developed and maintained batch data flow using HiveQL and Unix scripting
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Worked extensively with Sqoop for importing metadata from Oracle.
  • Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server and MySQL.
  • Managed the imported data from different data sources, performed transformation using Hive and Map- Reduce and loaded data in HDFS.
  • Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
  • Recommended improvements and modifications to existing data and ETL pipelines.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Created Hive schemas using performance techniques like partitioning and bucketing.
  • Used Hadoop YARN to perform analytics on data in Hive.
  • Involved in creating Hive tables and loading and analyzing data using hive queries.

Environment: Hadoop, MapReduce, AWS, HBase, JSON, Spark, Kafka, Hive, Pig, Hadoop YARN, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle, Yarn, Linux, Oozie.

Confidential

Data Analyst

Responsibilities:

  • Gathered requirements from Business and documented for project development.
  • Involved in understanding the legacy applications & data relationships.
  • Prepared and maintained documentation for on-going projects.
  • Worked with Informatica Power Center for data processing and loading files.
  • Extensively worked with Informatica transformations.
  • Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
  • Attended user design sessions, studied user requirements, completed detail design analysis and wrote design specs.
  • Created data maps in Informatica to extract data from Sequential files.
  • Performed Unit, Integration and System testing of various jobs.
  • Coordinated design reviews, ETL code reviews with teammates.
  • Interacted with key users and assisted them with various data issues, understood data needs and assisted them with Data analysis.
  • Extensively worked on UNIX Shell Scripting for file transfer and error logging.

Environment: Informatica Power Center, Oracle 10g, SQL Server, UNIX Shell Scripting

We'd love your feedback!