Data Engineer Resume
El Segundo, CA
SUMMARY
- 8+ years of overall IT experience in a variety of industries, this includes hands - on experience in Big Data Analytics and development.
- Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.
- Good Knowledge on Apache NiFi for automating and managing the data flow between systems.
- Good Understanding of Data ingestion, Airflow Operators for Data Orchestration, and other related python libraries.
- Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.
- Highly skilled in Business Intelligence tools like Tableau, PowerBI, Plotly and Dataiku.
- Experience in managing and analyzing massive datasets on multiple Hadoop frameworks like Cloudera and Hortonworks.
- Experience in designing and developing applications in Spark using Python to compare the performance of Spark with Hive.
- Hands-on Experience in Service Oriented Architecture (SOA), Event Driven Architecture, Distributed Application Architecture and Software as Service (script).
- In - depth understanding of Snowflake cloud technology.
- Experience in Spark-Scala programming with good knowledge on Spark Architecture and its In-memory Processing.
- Experience with Snowflake Multi-Cluster Warehouses.
- Experience with Snowflake Virtual Warehouses.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, docWatch, SNS, SES, SQS and other services of the AWS family.
- Good work experience with the cutting-edge technologies like Kafka, Spark, Spark streaming.
- Partnered with cross functional teams across the organization to gather requirements, architect, and develop proof of concept for the enterprise Data Lake environments like MAPR, CLOUDERA, HORTONWORKS, AWS, and AZURE.
- Strong Experience in analyzing data using HIVE, Impala, Pig Latin, and Drill. Experience in writing custom UDFs in Hive and Pig to extend the functionality.
- Experience in writing MAPREDUCE programs in java for data cleansing and preprocessing.
- Excellent understanding/knowledge on Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource manager, Node manager.
- Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight
- Good working experience with Hive and HBase/MapRDB Integration.
- Excellent understanding and knowledge of NOSQL databases like HBase, and Cassandra.
- Experienced in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
- Experience setting up instances behind Elastic Load Balancer in AWS for high availability and cloud integration with AWS using ELASTIC MapReduce (EMR).
- Experience in working in Hadoop eco-system integrated to the Cloud platform provided by AWS with several services like Amazon EC2 instances, S3 bucket and RedShift.
- Good experience working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, HDInsight/Databricks.
- Expose to various software development methodologies like Agile and Waterfall.
- Extensive experience working with spark distributed Framework involving Resilient Distributed Datasets (RDD) and Data Frames using Python, Scala and Java8.
- Involving in developing applications on Windows, UNIX, and Linux Platforms.
TECHNICAL SKILLS
BigData/Hadoop Technologies: MapReduce, Spark, Cloudera, SparkSQL, Azure HD Insights, Impala, AWS, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase, Flume, Flink, Yarn, Oozie, Zookeeper, Hue, Ambari Server, Databricks, EMR
Languages: HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB, Oracle, MySql, SQL Server
Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML, JDBC, Struts
Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.
Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Glue, Athena, SageMaker.
Orchestration tools: Oozie, Airflow.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris Learning knowledge: AWS lambda, pandas, cicd/jenkins..
PROFESSIONAL EXPERIENCE
Confidential, El Segundo, CA
Data Engineer
Responsibilities:
- Create and maintain reporting infrastructure to facilitate visual representation of manufacturing data for purposes of operations planning and execution.
- Extract, Transform and Load data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and Azure Data Lake Analytics.
- Implemented Restful web service to interact with Redis Cache framework.
- Intake happens through Sqoop, and Ingestion happens through Map Reduce, HBASE.
- Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
- Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
- Constructed product-usage SDK data and data aggregations by using PySpark, Scala,
- Spark SQL and Hive context in partitioned Hive external tables maintained in AWS S3 location for reporting, data science dashboarding, and ad-hoc analyses.
- Involved in data processing using an ETL pipeline orchestrated by AWS Data Pipeline using Hive.
- Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.
- Experience in creating configuration files to deploy the SSIS packages across all environments.
- Experience in writing queries in dd3 and R to extract, transform and load (ETL) data from large datasets using Data Staging.
- Implemented CI/CD pipelines using Jenkins and built and deployed the applications.
- Worked on developing Restful endpoints to cache application specific data in in-memory data clusters like Redis and exposed them with Restful endpoints.
- Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Interacting with other data scientists and architected custom solutions for data visualization using tools like Tableau, packages in R
- Developed predictive models using Python & R to predict customers churn and classification of customers.
- Played key role in Migrating Teradata objects into Snowflake environment.
- Coordinated with QA team in preparing for compatibility testing of Guidewire solution.
- Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modelling and data mining, machine learning and advanced data processing.
- Heavily involved in testing Snowflake to understand best possible way to use the cloud resources
- Experience in using Airflow Operators for Data Orchestration
- Designed and implemented by configuring Topics in the new Kafka cluster in all environments.
Environment: Hadoop, ETL operations, Data Warehousing, Data Modelling, Cassandra, AWS Cloud computing architecture, EC2, S3, Advanced SQL methods, NiFi, Python, Linux, Apache Spark, Scala, Spark-SQL, HBase
Confidential, Plano, TX
Data Engineer
Responsibilities:
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Written multiple Hive UDFS using Core Java and OOP concepts and spark functions within Python programs.
- Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise.
- Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
- Used Azure Event Grid for managing event service that enables you to easily manage events across many different Azure services and applications.
- Used Service Bus to decouple applications and services from each other, providing the benefits like Load-balancing work across competing workers.
- Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
- Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Delta lake supports merge, update and delete operations to enable complex use cases.
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with the whole Microsoft stack.
- Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate the monthly account level customer data.
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks (ADB) to speed up the preparation of high-quality data.
- Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time analytics streaming, sqoop, pig, hive and CosmosDB for batch jobs.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.
- Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
- Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
- Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend.
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Utilized Waterfall methodology for team and project management.
- Used Git for version control with Data Engineer team and Data Scientists colleagues.
Environment: Ubuntu 16.04, Hadoop 2.0, Spark (PySpark, Spark streaming, SparkSQL, SparkMLlib), Nifi, Jenkins, Pig 0.15, Python 3.x(Nltk, Pandas), Tableau 10.3, GitHub, Azure (Storage, DW, ADF,ADLS,Databricks),AWS Redshift and OpenCV.
Confidential, Rochester, MN
Data Engineer
Responsibilities:
- Migrate the existing data from Teradata/SQL Server to Hadoop and perform ETL operations on it.
- Responsible for loading structured, unstructured, and semi-structured data into Hadoop by creating static and dynamic partitions.
- Worked on different data formats such as JSON and performed machine learning algorithms in Python.
- Performing statistical data analysis and data visualization using Python and R
- Implemented data ingestion and handling clusters in real time processing using Kafka.
- Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
- Created a task scheduling application to run in an EC2 environment on multiple servers.
- Strong knowledge of various Data warehousing methodologies and Data modeling snowflake concepts.
- Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
- Scheduled different Snowflake jobs using NiFi.
- Used NiFi to ping snowflake to keep Client Session alive.
- Developed Hadoop streaming Map/Reduce works using Python.
- Created Hive partitioned tables using Parquet & Avro format to improve query performance and efficient space utilization.
- Responsibilities include Database Design and Creation of User Database.
- Moving ETL pipelines from SQL server to Hadoop Environment.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker and Kubernetes.
- Used SSIS, NIFI, Python scripts, Spark Applications for ETL Operations to create data flow pipelines and involved in transforming data from legacy tables to Hive, HBase tables, and S3 buckets for handoff to business and Data scientists to create analytics over the data.
- Support current and new services that leverage AWS cloud computing architecture including EC2, S3, and other managed service offerings.
- Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.
- Used advanced SQL methods to code, test, debug, and document complex database queries.
- Design relational database models for small and large applications.
- Designed and developed Scala workflows for data pull from cloud-based systems and applying transformations on it.
- The ability to develop reliable, maintainable, efficient code in most of SQL, Linux shell, and Python.
- Implemented Apache-spark code to read multiple tables from the real-time records and filter the data based on the requirement.
- Stored final computation result to Cassandra tables and used Spark-SQL, spark-dataset to perform data computation.
- Used Spark for data analysis and store final computation results to HBase tables.
- Troubleshoot and resolve complex production issues while providing data analysis and data validation.
Environment: Teradata, SQL Server, Hadoop, ETL operations, Data Warehousing, Data Modelling, Cassandra, AWS Cloud computing architecture, EC2, S3, Advanced SQL methods, NiFi, Python, Linux, Apache Spark, Scala, Spark-SQL, HBase
Confidential, Dallas Tx
Data Analyst
Responsibilities:
- Involved in designing/developing Logical Data Analyst & Physical Data Analyst using Erwin DM.
- Worked with DB2 Enterprise, Oracle Enterprise, Teradata13, Mainframe sources, Netezza Flat files, and datasets operational sources.
- Worked with various process improvements, normalization, de-normalization, data extraction, data cleansing, and data manipulation.
- Performed data management projects and fulfilling ad-hoc requests according to user specifications by utilizing data management software programs and tools like TOAD, MS Access, Excel, XLS and SQL Server.
- Worked with requirements management, workflow analysis, source data analysis, data mapping, Metadata management, data quality, testing strategy and maintenance of the model.
- Used DVO to validate the data moving from Source to Target.
- Creating the requests in answers and see the results in various views like title view, table view, compound layout, chart, pivot table, ticker and static view.
- Assisted in production OLAP cubes, wrote queries to produce reports using SQL Server Analysis Services (SSAS) and Reporting service (SSRS) Editing, upgrading and maintaining ASP.NET website and IIS Server.
- Used SQL Profiler for troubleshooting, monitoring, and optimization of SQL Server and non-production database code as well as T-SQL code from developers and QA.
- Involved in data from various sources like Oracle Database, XML, Flat Files, CSV files and loaded to target warehouse.
- Created complex mappings in Informatica Power Center Designer using Aggregate, Expression, Filter, Sequence
- Designed the ER diagrams, logical model (relationship, cardinality, attributes, and candidate keys) and physical database (capacity planning, object creation and aggregation strategies) for Oracle and Teradata as per business requirements using Erwin.
- Designed Power View and Power Pivot reports and designed and developed the Reports using SSRS.
- Designed and created MDX queries to retrieve data from cubes using SSIS.
- Created SSIS Packages using SSIS Designer for exporting heterogeneous data from OLE DB Source, Excel Spreadsheets to SQL Server.
- Extensively worked in SQL, PL/SQL, SQL Plus, SQL Loader, Query performance tuning, DDL scripts, database objects like Tables, Views Indexes, Synonyms and Sequences.
- Developed and supported the extraction, transformation, and load process (ETL) for a Data.
Environment: ERWIN9.1, Netezza, Oracle8.x, SQL, PL/SQL, SQL Plus, SQL Loader, Informatica, CSV, Taradata13, T-SQL, SQL Server, SharePoint, Pivot tables, Power view, DB2, SSIS, DVO, LINUX, MDM, PL/SQL, ETL, Excel, Pivot tables, SAS, SSAS, SPSS, SSRS.
Confidential, Atlanta, Georgia
Data Analyst
Responsibilities:
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose by Pig.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Deploy services on AWS and utilized step function to trigger the data pipelines.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
- Created plugins to extract data from multiple sources like Apache Kafka, Database and Messaging Queues.
- Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.
- Developing parser and loader MapReduce application to retrieve data from HDFS and store to HBase and Hive.
- Experienced in setting up Multi-hop, Fan-in, and Fan-out workflow in Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
- Implemented Custom Sterilizer, interceptors to Mask, created confidential data and filter unwanted records from the event payload in Flume.
- Configured, designed, implemented, and monitored Kafka cluster and connectors.
- Responsible for ingesting large volumes of IOT data to Kafka.
- Wrote Kafka producers to stream the data from external rest APIs to Kafka topics
- Worked with teams to use KSQL for real-time analytics.
- Worked with multiplexing, replicating and consolidation in Flume.
- Used OOZIE operational Services for batch processing and scheduling workflows dynamically.
Environment: Spark (PySpark, SparkSQL, Spark Streaming, SparkMLIib), Kafka, Python 3.x(Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift/Glue, and Pig.