Sr. Data Engineer/ Big Data Engineer Resume
Westlake, TX
SUMMARY
- Over 8+ years of experience as a big data developer with experience in designing and implementing various enterprise data warehouse, business intelligence, analytical, batch/real time/near real time streaming big data solutions.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Good understanding of distributed systems, HDFS architecture, internal working details of MapReduce and Spark processing frameworks.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate - wide ETL Solutions and Datawarehouse tools for reporting and data analysis.
- Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
- Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
- Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
- Pleasant experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
- Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
- Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.
- Experience in Databricks MLflow for running machining learning models on distributed platforms
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core, SparkSQL.
- Good knowledge of Spark Scala’s functional style programming techniques like Anonymous Functions (Closures), Higher Order Functions and Pattern Matching.
- Involved in converting Hive/SQL queries into Spark transformations using SparkData frames and Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Used Spark Data FramesAPI over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
- Having extensive knowledge on RDBMS such as Oracle, DevOps, MicrosoftSQLServer, MYSQL
- Extensive experience working on various databases and database script development using SQL and PL/SQL.
- Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
- Good understanding and knowledge of NoSQL databases like MongoDB, Azure,PostgreSQL,HBase and Cassandra.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
- Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
- Experience with operating systems: Linux, RedHat, and UNIX.
- Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Excellent programming skills with experience in Java, C, SQL and Python Programming.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.
- Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
- Good exposure in interacting with clients and solving application environment issues and can communicate effectively with people at different levels including stakeholders, internal teams and the senior management.
TECHNICAL SKILLS
Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP
Programming Languages: SQL, PL/SQL, Python, R, PYSpark, Pig, Hive QL, Scala, Shell Scripting, Regular Expressions
Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming
Cloud Infrastructure: AWS, Azure, GCP
Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)
Scripting &Query Languages: Shell scripting, SQL
Version Control: CVS, SVN and Clear Case, GIT
Build Tools: Maven, SBT
Containerization Tools: Kubernetes, Docker, Docker Swarm
Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks,UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau
PROFESSIONAL EXPERIENCE
Confidential, Westlake, TX
Sr. Data Engineer/ Big Data Engineer
Responsibilities:
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- I have written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
- Boosted the performance of regression models by applying polynomial transformation and feature selectionand used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results. .
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB).
- Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
- Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
- Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
- Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
- SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Created dashboards for analyzing POS data using Power BI
Environment: AWS, Azure, SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), Databricks, SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Kafka, Impala, Python, Power BI, Tableau, Presto, Hive/Hadoop, Snowflakes.
Confidential, Phoenix, AZ
Sr. Big Data Engineer
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
- Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Queryto develop and maintain GCP cloud base solution.
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
- Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
- Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
- Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
- Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Impala, Scala, pig, NoSQL, Oozie, Hbase, Data Lake, Python, Azure, Databricks, AWS(Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline, EventBridge, Athena), Unix/Linux Shell Scripting, Informatica PowerCenter.
Confidential, Springfield, MA
Big Data Engineer
Responsibilities:
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
- Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
- Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
- Worked with the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
- Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
- Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers and policies.
- Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
- Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
- Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
- Extensively worked on Python and build the custom ingest framework.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Hadoop YARN, Azure, Databricks, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, impala, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Control-M, Java, AWSS3, Oracle 12c, Linux
Confidential, Troy, MI
Data Engineer
Responsibilities:
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
- Strong understanding of AWS components such as EC2 and S3
- Implemented a Continuous Delivery pipeline with Docker and Git Hub
- Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
- Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoopclusters which are set up in AWS EMR.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Used Git for version control with Data Engineer team and Data Scientists colleagues.
- Developed and deployed data pipeline in cloud such as AWS and GCP
- Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling skilled in data visualization like Matplotlib and seaborn library
- Hands on experience with big data tools like Hadoop, Spark, Hive
- Experience implementing machine learning back-end pipeline with Pandas, NumPy
Environment: Gcp, Bigquery, Gcs Bucket,, Cloud Shell, Hadoop, Spark,Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, Numpy, ETL workflows, Python, impala, Kafka, Scala.
Confidential
Hadoop/ Spark Developer
Responsibilities:
- Performed data transformations like filtering, sorting, and aggregation using Pig
- Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
- Created Hive tables to push the data to MongoDB.
- Wrote complex aggregate queries in mongo for report generation.
- Developed scripts to run scheduled batch cycles using Oozie and present data for reports
- Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
- Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScalaAPI and Spark.
- Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoopstreaming, ApacheSpark, SparkSQL, Scala, Hive, and Pig.
- Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
- Performed data validation and transformation using Python and Hadoop streaming.
- Developed highly efficient PigJavaUDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
- Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
- Automated workflows using shell scripts and Control-M jobs to pull data from various databases into HadoopDataLake.
- Extensively used DB2 Database to support the SQL
- Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
- Inserted Overwriting the HIVE data with HBasedata daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
- Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
- Designed, developed and maintained Big Data streaming and batch applications using Storm.
- Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
- Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.
Environment: s: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Oracle, Sql, Splunk, UNIX, Shell Scripting.