We provide IT Staff Augmentation Services!

Sr Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • Over 8+ years of IT industry experience in Analysis, Design, Implementation, Development, Maintenance and test large scale applications using SQL, Hadoop and other Big Data technologies.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Hands - on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Expertise withBig data on AWS cloud services i.e., EC2, S3, Auto Scaling, Glue, Lambda, Cloud Watch, CloudFormation, Anthena, DynamoDB and RedShift
  • Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality
  • Experienced in big data analysis and developing data models using Hive, PIG, and Mapreduce, SQL with strong data architecting skills designing data-centric solutions.
  • Participates in the development improvement and maintenance of snowflake database applications
  • Hands on experience on tools like Pig&Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
  • Extensive experience with Real-time streaming technologies Spark, Storm, Kafka
  • Performed optimizing MapReduce Programs using combiners, partitioners, and custom counters for delivering the best results.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Good knowledge of integrating Spark Streaming with Kafka for real time processing of streaming data
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Worked on Configuring Zookeeper, Kafka and log stash cluster for data ingestion and Elastic search performance and optimization and worked on Kafka for live streaming of data.
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
  • Good experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka
  • UsedZookeeperto provide coordination services to the cluster.
  • Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, HDFS, Hive, MapReduce, Pig, Sqoop, Flume, OozieHadoop distribution, Kafka, and HBase PySpark, Spark, AirflowTeradata, Big Data Architect

Programming Languages: Java, Python, Scala Programming, Java Programming

Databases/RDBMS: MySQL, SQL/PL-SQL, MS-SQL Server 2005, Oracle 9i/10g/11g/12c

NO SQL: Mongo DB, Hbase, Cassandra

Scripting/ Web Languages: JavaScript, HTML5, CSS3, XML, SQL, Shell

ETL Tools: Cassandra, HBASE, ELASTIC SEARCH

Operating Systems: Linux, Windows

Software Life Cycles: SDLC, Waterfall and Agile models

Office Tools: MS-Office, MS-Project and Risk Analysis tools, Visio

Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UIANT, Maven, Automation and MR-Unit, Azure Data Lake,Analytics, Azure DevOps, Azure Data Bricks

Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Snowflake

PROFESSIONAL EXPERIENCE

Confidential, Charlotte, NC

Sr Big Data Engineer

Responsibilities:

  • Programmatically created CICD Pipelines in Jenkins using Groovy scripts, Jenkins file, integrating a variety of Enterprise tools and Testing Frameworks into Jenkins for fully automated pipelines to move code from Dev Workstations to all the way to Prod environment.
  • Wrote Kafka producers to stream the data from external rest API to Kafka topics.
  • Involved in creating Hive tables, loading, and analysing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
  • Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Using python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
  • Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
  • Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Created program inpythonto handle PL/SQL functions like cursors and loops which are not supported by snowflake.
  • Worked on SSIS creating all the interfaces between front end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.
  • Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch,Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Experience in integrating Jenkins with various tools like Maven (Build tool), Git (Repository), SonarQube (code verification), Nexus (Artifactory) and implementing CI/CD automation for creating Jenkins pipelines programmatically architecting Jenkins Clusters, and scheduled builds day and overnight to support development needs.
  • Good hands-on participation in the development and modification of SQL stored procedure techniques, functions, views, indexes, and triggers.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
  • Experience working for EMR cluster in AWS cloud and working with S3.
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
  • Responsible for ingesting large volumes of IOT data to Kafka.
  • Setting up data pipelines instream setsto copy data from oracle to Snowflake.
  • Installed Docker Registry for local upload and download of Docker images and from Docker Hub and created Docker files to automate the process of capturing and using the images.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Streamed real time data by integrating Kafka with Spark for dynamic price surging using machine learning algorithm.

Environment: AWS, Jenkins, EMR, Spark, Hive, S3, Athena, Snowflake, Airflow, Sqoop, Kafka, HBase, Redshift, ETL, Pig, Oozie, Spark Streaming, Hue, Scala, Python, Apache NIFI, GIT, Micro Services

Confidential, Branch Burg, NJ

Big Data Engineer

Responsibilities:

  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • Continuous monitoring and managing theHadoop clusterthroughCloudera Manager.
  • Scheduled Oozie workflow engine to run multiple HiveQL, Sqoop, and Pig jobs.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Writing pyspark and spark sql transformation in Azure Data bricks to perform complex transformations for business rule implementation
  • Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase.
  • Data Import and Export from various sources through Script and Sqoop.
  • Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run Map Reduce jobs in the backend.
  • Create dashboards on snowflake cost model, usage inQlikView.
  • Experience in working with Map Reduce programs using Apache Hadoop for working with Big Data
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
  • Develop, Deploy and Troubleshoot the ETL Work Flows using Hive, Pig and Sqoop.
  • Perform Big Data analysis using Scala, Spark, Spark SQL, Hive, Mlib, Machine Learning algorithms.
  • Migrated projects from Cloudera Hadoop Hive storage to Azure Data Lake Store to satisfy Confidential Digital transformation strategy
  • Extracted the needed data from the server into HDFS andBulk Loadedthe cleaned data intoHbase.
  • Worked withZookeeper, Oozie, and Data Pipeline Operational Servicesfor coordinating the cluster and scheduling workflows.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Responsible for MapRHadoop Cluster Administration (Installations, upgrades, configuration and
  • Analyzed data by performing Hive queries and running Pig scripts to study customer behavior.
  • Collaborate with team members and stakeholders in design and development of data environment
  • Wrote Custom Map Reduce Scripts for Data Processing in Java
  • Created Hive tables to store data into HDFS, loading data and writing hive queries that will run internally in map reduce way.
  • Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.

Environment: Hadoop, HDFS, Kafka, Azure, Databricks, Data factory, Map reduce, Snowflake, Scala, Python, Spark, Hive, HBase, Pig, Zookeeper, Oozie, Sqoop, PL/SQL, Oracle, Mongo DB, R, Windows.

Confidential, Atlanta, GA

Data Engineer

Responsibilities:

  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Applied Apache Kafka to transform live streaming with the batch processing to generate reports.
  • UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
  • Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
  • Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.
  • Involved in support forAmazon AWS and RDSto host static/media files and the database intoAmazon Cloud.
  • Automated RabbitMQ cluster installations and configuration using Python/Bash.
  • Analyzed data by performing Hive queries and running Pig scripts to study customer behavior.
  • Experience in job workflow scheduling and monitoring tools likeOozieand good knowledge onZookeeperto coordinate the servers in clusters and to maintain the data consistency.
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Designed and Developed Real Time Data Ingestion frameworks to fetch data from Kafka to Hadoop.
  • UpdatedPythonscripts to match training data with our database stored inAWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Using partitioning and bucketing in HIVE to optimize queries.
  • Implemented real time system with Kafka and Zookeeper.
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Used Python Library Beautiful Soup for web scrapping to extract data for building graphs.
  • Continuous monitoring and managing the Hadoop cluster throughClouderaManager.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs.
  • By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking

Environment: Hadoop, HDFS, MapReduce, Hive, Kafka, Spark, AWS, Airflow, Python, ETL workflows, Python, Scala, Spark, PL/SQL, SQL Server, Tableau, ETL, Pig

Confidential

Hadoop Developer

Responsibilities:

  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
  • Developed suit of Unit Test Cases forMapper, Reducer and Driverclasses using testing library.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Worked on developingETL Workflowson the data obtained using Scala for processing it in HDFS andHBaseusing Oozie.
  • Written ETL jobs to visualize the data and generate reports from MySQL database using DataStage.
  • DevelopedPythonscripts to find vulnerabilities with SQL Queries by doing SQL injection
  • Loaded data fromUNIXfile system to HDFS and writtenHive User Defined Functions.
  • Used Sqoop to load data from DB2 toHBasefor faster querying and performance optimization.
  • Worked on streaming to collect this data fromFlumeand performed real time batch processing.
  • DevelopedHive scriptsfor implementing dynamic partitions.
  • DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
  • Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
  • Installed and configuredHive, Pig, Sqoop, Flume and Oozieon the Hadoop cluster.
  • DevelopedSqoopscripts for loading data into HDFS from DB2 and pre-processed with PIG.
  • Automated the tasks of loading the data into HDFS and pre-processing with Pig by developing workflows using Oozie

Environment: Hadoop, HDFS, Hive, Pig, Flume, Mapper, Flume, ETL Workflows, HBase, Python, Sqoop, Oozie, DataStage, Linux, Relational Databases, SQL Server, DB2

Confidential

Data Stage Developer

Responsibilities:

  • Designed Data Stageparalleljobs using designer to extract data from various source systems, Transformation and conversion of data, Load data-to-data warehouse and Send data from warehouse to third party systems like Mainframe.
  • Developed various Server and Parallel jobs using Oracle, ODBC, FTP, Peek, Aggregator, Filter, Funnel, Copy, Hash File, Change Capture, Merge, look up, Join, Sort, Merge, Lookup stages.
  • Converted the design into well-structured and high quality DataStage jobs.
  • Developed PL/SQL Procedures, Functions, Packages, Triggers, Normal and Materialized Views.
  • Performed ETL Performance tuning to increase the ETL process speed.
  • Used the DataStage Designer to develop processes for Extracting, Cleansing, Transforming, Integrating, and Loading data into Datawarehouse.

Environment: Data stage, Toad for Oracle, Python, Pl/Sql, Sql Server, Git, Windows

We'd love your feedback!