We provide IT Staff Augmentation Services!

Sr Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Woonsocket, RI

SUMMARY

  • Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
  • Strong experience in Hadoop distributed file system (HDFS), Impala, Sqoop, Kafka, Hive, Spark, Hue, MapReduce framework, Oozie, Zookeeper and Pig.
  • Installing Packages and setting up an CDH cluster coordinating with Zookeeper, Spark, Kafka, HDFS.
  • Experience in data analysis using Hive, Pig Latin, Hbase and custom Map Reduce programs in Java.
  • Experience in writing custom UDFs in JAVA and SCALA for HIVE and Pig to extend the functionality.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations
  • Experience with Cloudera and Horton works distributions.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
  • Experience in migrating other databases to Snowflake.
  • Well versed withBig data on AWS cloud services i.e. EC2, S3, Glue, Anthena, DynamoDB and RedShift
  • Spark Streaming collects this data from Kafka in near - real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in Cassandra cluster.
  • Experience in Creating Power BI reports and upgraded power pivot reports to Power BI.
  • Define virtual warehouse sizing for Snowflake for different type of workloads.
  • Solid experience on Azure Data Lake Analytics, Azure Data Bricks, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies.
  • Experienced in processing large datasets with Spark using Python.
  • Performed Hadoop backup Strategy to take the backup of hive, HDFS, HBase, Oozie etc.
  • Written Oozie workflow to invoke the Jobs in predefined Interval.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • On Other Hand working on POC with Kafka and NIFI to pull the real-time events into Hadoop Box.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SPARK-SQL, DATA FRAME, PAIR RDD'S and YARN.
  • Experienced in managingHadoopCluster using Hortonworks Ambari.
  • Experience in working with Flume to load the log data from multiple sources directly into HDFS.
  • Read the data from HBase to Spark toperform Join on different tables.

TECHNICAL SKILLS

Languages: Scala, Python, SQL, Java, PL/SQL, Linux shell scripts

Hadoop Ecosystem: HDFS, Spark/PySpark, YARN, Hive, Hbase, Impala, Zookeeper, Sqoop, Oozie, Drill, Flume, Spark/PySpark, Solr and Avro, AWS, Amazon EC2, S3, Azure Databricks

Enterprise Technologies: J2EE, Event Hub, Kinesis, JDBC, Kafka

Operating Systems: Windows, Linux, UNIX, Mac-OS.

IDEs: Eclipse, IntelliJ

Cloud Environment: AWS, Azure, Snowflake

Relational Databases: Oracle, SQL, DB2, MySQL, Teradata

NoSQL databases: Hbase, Mongo DB, Cassandra

Markup Languages: HTML, XHTML, XML, DHTML.

Build & Management Tools: ANT, MAVEN, SVN.

Query Languages: SQL, PL/SQL.

Methodologies: SDLC, OOAD, Agile.

Continuous Integration Tools: Jenkins, Docker, Kubernetes

PROFESSIONAL EXPERIENCE

Confidential, Woonsocket, RI

Sr Big Data Engineer

Responsibilities:

  • Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
  • Designed and implemented by configuring Topics in new Kafka cluster in all environment
  • Demonstrated expert level technical capabilities in areas of Azure Batch and Interactive solutions, Azure Machine learning solutions and operationalizing end to end Azure Cloud Analytics solutions.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Designed changes to transform current Hadoop jobs to HBase.
  • Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
  • Developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Configured Flume to extract the data from the web server output files to load into HDFS.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Used windows Azure SQL reporting services to create reports with tables, charts and maps.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
  • Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
  • Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Responsible for data services and data movement infrastructures
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Implemented Kafka Security Features using SSL and without Kerberos. Further with more grain-fines Security I set up Kerberos to have users and groups this will enable more advanced security features
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Environment: Hadoop YARN, Spark, Spark Streaming, MapReduce, Spark SQL,Kafka, Scala, Azure, Python, Hive, Sqoop, Impala, Tableau, Talend, Oozie, Control-M, HBase, Java, Oracle 12c, Linux

Confidential, Dallas, TX

Big Data Engineer

Responsibilities:

  • Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.
  • Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
  • Analyzing SQL scripts and designed the solution to implement using PySpark
  • Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
  • Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
  • Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
  • Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
  • Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.
  • Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
  • Filtering and cleaning data using Scala code and SQL Queries
  • Troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) onEC2.
  • Used Talend for Big Data Integration using Spark and Hadoop.
  • Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java.
  • Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Generate metadata, create Talend etl jobs, mappings to load data warehouse, data lake.
  • Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.

Environment: Hadoop, Spark, Scala, Hbase, Hive, Python, PL/SQL AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, IBM Info sphere, DataStage, MapReduce, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Cassandra, MongoDB, Kafka, MS Access database, Autosys, UNIX, Erwin.

Confidential, Charlotte, NC

Data Engineer

Responsibilities:

  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Implementing and Managing ETL solutions and automating operational processes.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the Tensor Flow Model for efficiency
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Compiled data from various sources to perform complex analysis for actionable results
  • Strong understanding of AWS components such as EC2 and S3
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB

Environment: AWS,EC2, S3, SQL Server, Erwin, Oracle, Redshift, Informatica, RDS, NOSQL, Snow Flake Schema, MySQL, Dynamo DB, Docker, PostgreSQL, Tableau, Git Hub

Confidential

Data Engineer

Responsibilities:

  • Installed and configuredFlume,Hive,Pig,SqoopandOozieon the Hadoop cluster.
  • Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
  • DevelopedPigLatin scripts to extract and filter relevant data from the web server output files to load into HDFS.
  • UsedAWS S3to store large amount of data in identical/similar repository.
  • Developed job workflow inOozieto automate the tasks of loading the data intoHDFSand few otherHivejobs.
  • UsedHiveto analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
  • Analyzed the data by performingHivequeries and runningPigscripts to study customer behavior.
  • OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
  • Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into the Hadoop Distributed File System andPIGto pre-process the data.
  • DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
  • Experienced in managing and reviewing the Hadoop log files usingShell scripts.
  • DevelopedFlumeAgents for loading and filtering the streaming data intoHDFS.
  • Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
  • Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
  • Experience in creating variousOoziejobs to manage processing workflows.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.

Confidential

Data Analyst

Responsibilities:

  • Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
  • Created new procedures to handle complex logic for business and modified already existing stored procedures, functions, views and tables for new enhancements of the project and to resolve the existing defects.
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).
  • Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.
  • Loading data from various sources like OLEDB, flat files to SQL Server database Using SSIS Packages and created data mappings to load the data from source to destination.
  • Extensive use of Expressions, Variables, Row Count in SSIS packages

Environment: MS SQL Server, SQL Server Business Intelligence Development Studio, SSIS, SSRS, Report Builder, Office, Excel, Flat Files, T-SQL.

We'd love your feedback!