We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Dearborn, MI

SUMMARY

  • Around 8+ Years of experienced Big Data - Hadoop developer with varying level of expertise around different Big Data/Hadoop ecosystem projects which include Spark streaming, HDFS, MapReduce, NiFi, HIVE, HBase, Storm, Kafka, Flume, Sqoop, Zookeeper, Oozie etc.
  • Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
  • Experience in Creating ETL mappings using Informatica to moveData from multiple sources like Flat files, Oracle into a common target area such asData Warehouse.
  • Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
  • Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
  • Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportthrough use of ETL tools such as Informatica.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experienced in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, and EMR, Elasticsearch), Hadoop, Python, Spark and effective use of map-reduce, SQL and Cassandra to solve big data type problems.
  • Cloudera certified developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Excellent implementation knowledge of Enterprise/Web/Client Server using Java, J2EE.
  • Expertise in working with Linux/Unix and shell commands on the Terminal.
  • Extensively used Microservices and Postman for hitting Hadoop clusters.
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Skilled in Tableau Desktop versions 10x for data visualization, Reporting and Analysis.
  • Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, Calculated fields, Sets, Groups, Parameters etc., in Tableau.
  • Hands on learning with different ETL tools to get data in shape where it could be connected to Tableau through Tableau Data Extract.
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Experience in working with SAS Enterprise Guide Software for reporting and analytical tasks.
  • Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction using Oracle and Teradata.
  • Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
  • Segmentation Analysis, Regression Models, and Clustering.
  • Experience in collection of Log Data and JSON data into HDFS using Flume and processed the data using Hive/Pig.
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory (ADF V1/V2).
  • Experience working with Hortonworks and Cloudera environments.
  • Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
  • Experience in using build/deploy tools such asJenkins, Docker and OpenShiftfor Continuous Integration & Deployment for Microservices.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
  • Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Big Data: HDFS, Map Reduce, Spark, Airflow, Yarn, NiFi, HBase, Hive, Pig, Flume, Sqoop, Kafka, Oozie, Hadoop, Zookeeper, Spark SQL.

Languages: SQL, PL/SQL, PYTHON, Java, Scala, C, HTML, Unix, Linux

Operating Systems: UNIX, Windows, Linux

Cloud Platform: AWS (Amazon Web Services), Microsoft Azure

RDBMS: Oracle 10g/11g/12c, Teradata, My SQL, MS SQL

NO SQL: MongoDB, HBase, Cassandra

Concepts and Methods: Business Intelligence, Data Warehousing, Data Modeling, Requirement Analysis

Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss

Data Modeling Tools: ERwin, Power Designer, Embarcadero ER Studio, IBM Rational Software Architect, MS Visio, ER Studio, Star Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables

ETL Tools: AWS Redshift Matillion, Alteryx, Informatica PowerCenter, Ab Initio

Other Tools: Azure Databricks, Azure Data Explores, Azure HDInsight, Power BI

PROFESSIONAL EXPERIENCE

Confidential, Dearborn, MI

Senior Big Data Engineer

Responsibilities:

  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop
  • Developed Hive queries to pre-process the data required for running the business process
  • Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities ofSparkusingScala.
  • Extract Real time feed usingKafkaandSpark Streamingand convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Experienced in writing real-time processing and core jobs usingSpark StreamingwithKafkaas a data pipe-line system.
  • ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
  • UsedSparkandSpark-SQLto read the parquet data and create the tables in hive using theScala API.
  • Experienced in using thesparkapplication master to monitor thesparkjobs and capture the logs for the spark jobs.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
  • Worked withSparkfor improving performance and optimization of the existing algorithms in Hadoop usingSpark Context,Spark-SQL,Data Frames and Pair RDD's.
  • Involved inCassandraCluster planning and had good understanding in Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
  • Responsible in development ofSpark Cassandraconnector to load data from flat file to Cassandra for analysis, modified Cassandra. Yaml and Cassandra-env.sh files to set various configuration properties.
  • UsedSqoopto import the data on toCassandratables from different relational databases like Oracle, MySQL and Designed Column families in Cassandra performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Developed efficient MapReduce programs for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and pre-processing onHortonworks.
  • Used Impala and Presto for querying the datasets.
  • Developed multipleKafkaProducers and Consumers as per the software requirement specifications.
  • UsedSpark StreamingAPIs to perform transformations and actions on the fly for building common learner data model which gets the data fromKafkain near real time and persist it toCassandra.
  • Experience in building Real-time Data Pipelines withKafkaConnect andSpark Streaming.
  • UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Developed code in Java which creates mapping in ElasticSearch even before data is indexed into.
  • UsingSpark-StreamingAPIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
  • Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Responsible in development ofSpark Cassandraconnector to load data from flat file to Cassandra for analysis.
  • Worked onHortonworks-HDP distribution.
  • UsedHortonworksApache Falcon for data management and pipeline process in the Hadoop cluster.
  • Used impala to query the data into the publish layers where all the other teams or business users can access for faster processing.
  • MaintainedELK(Elastic Search, Logstash, Kibana) and Wrote Spark scripts using Scala shell.
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Strong experience in working withELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances.
  • Developed shell scripts to generate thehivecreate statements from the data and load data to the table.
  • Involved in writing customMap-Reduceprograms using java API for data processing.
  • TheHivetables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
  • DevelopedHivequeries for the analysts by loading and transforming large sets of structured, semi structured data using hive.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Got chance working onApache NiFilike executingSpark script, Sqoop scripts throughNiFi, worked on creating scatter and gather pattern inNiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a customNiFiprocessor for filtering text from Flow files etc.
  • Cluster coordination services throughZookeeper.

Environment: HDP, Hadoop, AWS, EC2, S3 Bucket, Redshift, Cassandra, Hive, HDFS, Spark, Spark-SQL, Spark-Streaming, Scala, KAFKA, Hortonworks, Map Reduce, Apache Nifi, Impala, Zookeeper, ELK, Sqoop, Java Oracle 12c, Sql Server, T-SQL, MongoDB, Hbase, Python and Agile Methodologies.

Confidential, Nashville, TN

Senior Big Data Engineer

Responsibilities:

  • Implemented a generic ETL framework withhigh availabilityfor bringing related data for Hadoop & Cassandra from various sources using spark.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Involved in creating the notebooks for moving data from raw to stage and then to curated zones using Azure data bricks.
  • Used SQL Azure extensively for database needs in various applications.
  • Created multiple dashboards in tableau for multiple business needs.
  • Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.
  • Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
  • Participated in various upgradations and troubleshooting activities across enterprise.
  • Knowledge in performancetroubleshooting and tuningHadoop clusters.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • Deploying windows Kubernetes (K8s) cluster with Azure Container Service (ACS) from Azure CLI and Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test and Octopus Deploy.
  • Experience in building data pipelines using Azure Data factory, Azure Databricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and Controlling and granting database access.
  • Migrated complex MapReduce programs into in memory Spark processing using Transformations and actions.
  • Used Avro serializer and Avro De serializer for developing the Kafka clients.
  • Developed Kafka Consumer job to consume data from Kafka topic and perform validations on the data before pushing data into Hive and Cassandra databases.
  • Applied spark streaming for real time data transforming.
  • Experienced in usingPlatforaa data visualization tool specific for Hadoop, and created various Lens and Viz boards for a real-time visualization from hive tables.
  • Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks.
  • Building the pipelines to copy the data from source to destination in Azure Data Factory.
  • Queried and analyzed data fromCassandrafor quick searching, sorting and grouping throughCQL.
  • Implemented various Data Modeling techniques forCassandra.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Create and deploy Azure command line scripts to automate tasks.
  • Worked with Spark using Scala and Spark SQL for faster testing and processing of data.
  • AppliedSparkadvanced procedures liketext analytics and processingusing thein-memoryprocessing.
  • Implemented ApacheDrillon Hadoop to join data from SQL and NoSQL databases and store it in Hadoop.
  • Created architecture stack blueprint for data access with NoSQL DatabaseCassandra;
  • Brought data from various sources in to Hadoop and Cassandra usingKafka.
  • Experienced in usingTidal enterprise scheduler and OozieOperational Services for coordinating the cluster and scheduling workflows.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Devised and lead the implementation of next generation architecture for more efficient data ingestion and processing.
  • Created and implemented variousshell scriptsfor automating the jobs.
  • ImplementedApache Sentryto restrict the access on the hive tables on a group level.
  • EmployedAVROformat for the entire data ingestion for faster operation and less space utilization.
  • Experienced in managing and reviewing Hadoop log files.
  • Worked inAgile environment,and used rally tool to maintain the user stories and tasks.
  • Worked withEnterprise data supportteams to install Hadoop updates, patches, version upgrades as required and fixed problems, which raised after the upgrades.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • Implemented test scripts to support test-driven development and continuous integration.
  • Used Spark for Parallel data processing and better performances.

Environment: MapR, MapReduce, spark, Scala,Solr, Java, Azure SQL, Azure Databricks, Azure Data Lake, HDFS, Hive, pig, Impala,Cassandra, Python, Kafka, Tableau, Teradata, CentOS, Pentaho, PIG, Zookeeper, Sqoop.

Confidential, Madison, WI

Big Data Engineer

Responsibilities:

  • Strong understanding of AWS components such as EC2 and S3
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Advanced knowledge on Confidential Redshift and MPP database concepts.
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Implementing and Managing ETL solutions and automating operational processes.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Compiled data from various sources to perform complex analysis for actionable results
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the TensorFlow Model for efficiency
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
  • Collaborate with team members and stakeholders in design and development of data environment
  • Preparing associated documentation for specifications, requirements, and testing

Environment: AWS,EC2, S3, SQL Server, Erwin, Oracle, Redshift, Informatica, RDS, NOSQL, Snow Flake Schema, MySQL, Dynamo DB, Docker, PostgreSQL, Tableau, Git Hub.

Confidential, Philadelphia, PA

Data Engineer

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
  • Strong understanding of AWS components such as EC2 and S3
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
  • UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Implemented a Continuous Delivery pipeline with Docker and Git Hub
  • Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Used Git for version control with Data Engineer team and Data Scientists colleagues.
  • Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling skilled in data visualization like Matplotlib and seaborn library
  • Hands on experience with big data tools like Hadoop, Spark, Hive
  • Experience implementing machine learning back-end pipeline with Pandas, NumPy

Environment: HBase, Cloud Shell, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, Numpy, ETL workflows, Python, kafka, Scala, Spark

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Pig and also written Pig Latin scripts.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Created batch jobs and configuration files to create automated process using SSIS.
  • Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
  • Built SSIS packages, to fetch file from remote location like FTP and SFTP, decrypt it, transform it, mart it to data warehouse and provide proper error handling and alerting
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Created new procedures to handle complex logic for business and modified already existing stored procedures, functions, views and tables for new enhancements of the project and to resolve the existing defects.
  • Loading data from various sources like OLEDB, flat files to SQL Server database Using SSIS Packages and created data mappings to load the data from source to destination.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
  • Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop.
  • Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
  • The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
  • Worked on different dataflow and control flow task, for loop container, sequence container, script task, executes SQL task and Package configuration.
  • Involved in review of functional and non-functional requirements.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Extensive use of Expressions, Variables, Row Count in SSIS packages
  • Data validation and cleansing of staged input records was performed before loading into Data Warehouse
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).
  • Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.

Environment: Hadoop, MapReduce, Pig,MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, Hbase, SSIS, SSRS, Report Builder, Office, Excel, Flat Files, .NET, T-SQL.

We'd love your feedback!