Sr. Data Engineer Resume
Sacramento, CA
PROFESSIONAL EXPERIENCE:
- Proficient IT professional experience with 8+ years of experience specialized in Hadoop/Big Data ecosystem, Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing, Cloud Engineering, Data Warehousing.
- Experience with proficient knowledge of Data Analytics, Machine Learning (ML), Predictive Modeling, Natural Language Processing (NLP), and Deep Learning algorithms.
- A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.
- Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Flume, Kafka, Nifi, MapReduce framework, Yarn and Scala.
- In - depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, MapReduce, Spark.
- Experienced in using various Python libraries like NumPy, SciPy, python-twitter, Pandas.
- Worked on visualization tools like Tableau for report creation and further analysis.
- Proficient at writing MapReduce jobs and UDF’s to analyze, transform, and deliver the data as per requirements.
- Good Knowledge on architecture and components of Spark, and efficient in working with hadoop Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using Spark with various data sources like JSON, Parquet and Hive.
- Expertise in building Pyspark, Java and Scala applications for interactive analysis, batch processing, and stream processing.
- Experience with data pipeline building, backend microservice development, and REST API using Python, Java, or comparable language.
- Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML.
- Good Hands-on full life cycle implementation using CDH (Cloudera) and HDP distributions.
- Experience in using Flume to load log files into HDFS and Oozie for workflow design and scheduling.
- Experience developing high throughput streaming applications from Kafka queues and writing enriched data back to outbound Kafka queues.
- Have working knowledge in Kimball Methodology and good working knowledge on Yellow bricks.
- Analytical approach to problem-solving; ability to use technology to solve business problems using Azure data factory, data lake and azure synapse.
- Strong working experience with SQL and NoSQL databases, data modeling and data pipelines. Involved in end-to-end development and automation of ETL pipelines using SQL and Python.
- Proficient in NoSQL databases including HBase, Cassandra, MongoDB, and its integration with Hadoop cluster.
- Extensive experience in the implementation of Continuous Integration (CI), Continuous Delivery and Continuous Deployment (CD) on Various Java based applications using Jenkins, TeamCity, Azure Devops, Maven, Git, Nexus, Docker, and Kubernetes.
- Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
- Orchestration experience using Azure Data Factory, Airflow 1.8 and Airflow 1.10 on multiple cloud platforms and able to understand the process of leveraging the Airflow Operators.
- Expertise in creating Kubernetes cluster with cloud formation templates and PowerShell scripting to automate deployment in a cloud environment.
- Excellent Knowledge of Amazon Web Service (AWS) concepts like EMR and EC2 web services which provide fast and efficient processing of Teradata Big Data Analytics.
- Working knowledge of Azure cloud components (HDInsight, DataBricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
- Involved in migration of the legacy applications to cloud platform using DevOps tools like GitHub, Jenkins, JIRA, Docker, and Slack.
- Experience in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations usingTableau, Power BI, Arcadia, andMatplotlib.
- Ingested data into Snowflake cloud data warehouse using Snowpipe. Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.
- Good communication and strong interpersonal and organizational skills with the ability to manage multiple projects. Always willing to learn, adopt new technologies.
TECHNICAL SKILLS:
Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka, Impala, Stream sets, Oozie, Spark, Zookeeper, NiFi, Airflow.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
Languages: Python, Scala, Java, R, Pig Latin, HiveQL, Shell Scripting.
Software Methodologies: Agile, SDLC Waterfall.
Design Patterns: Eclipse, Net Beans, Pyspark, IntelliJ, Spring Tool Suite, Jenkin’s, Kubernetes, Docker, REST API.
Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, SQL SERVER, Teradata.
NoSQL: HBase, MongoDB, Cassandra.
ETL/BI: PowerBI, Tableau, Talend, Snowflake, Informatica, Dax, SSIS, SSRS, SSAS, QlikView, Qlik Sense.
Version control: GIT, SVN, Bitbucket.
Web Development: JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Angular, Hibernate, Tomcat.
Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.
Cloud Technologies: Amazon Web Services, EC2, S3. Redshift, Microsoft Azure, Data Bricks, DAX, ADF, Yellow bricks, GCP.
PROFESSIONAL EXPERIENCE
Confidential, Sacramento, CA
Sr. Data Engineer
Responsibilities:
- Developed Impala queries for faster querying and perform data transformations on Hive tables.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Implemented Spark to migrate MapReduce jobs into Spark RDD transformations and Spark streaming.
- Worked on Large sets of structured, semi structured, and unstructured data.
- Worked with Sqoop for importing data from relational data bases.
- UsedPandas, NumPy, Seaborn, SciPy, Matplot lib, Scikit-learn, NLTKinPythonfor developing various algorithms in machine learning and utilized some of ML algorithms such as linear regression, multivariate regression, and naiveBayes for data analysis.
- Experience with Apache Spark streaming and Batch framework. Create Spark jobs for data transformation and aggregation.
- Installed Kafka on Hadoop cluster and configured producer and consumer in java to establish a connection from the source to HDFS with popular hashtags.
- Configured Spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table and Data Lake is used to store and do all types of processing and analytics.
- Designed and architected scalable data processing and analytics solutions, including technical feasibility, integration, development for Big Data storage, processing and consumption of Azure data analytics, big data (Hadoop, Spark), business intelligence (Reporting Services, Power BI), NoSQL, HDInsight, Stream Analytics, Data Factory, Event Hubs and Notification Hubs.
- Built numerous technology demonstrators using Confidential Edison Arduino shield using Azure EventHub and Stream Analytics, integrated with PowerBI and Azure ML to demonstrate the capabilities of Azure Stream Analytics.
- Working on Google Cloud Platform (GCP) services like cloud storage, cloud SQL, stack driver monitoring.
- Created the enterprisedatawarehousedatabricks process via SSIS packages from scratch and datawarehouseETL job staging, dimension and fact processing, anddatamart population process.
- Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Involved in writing unit tests, worked along with DevOps team in Installing libraries, Jenkins’s agents and productionized ETL jobs and microservices.
- Developed multiple MapReduce jobs in java to clean datasets. Extensively used Java Collections API like Lists, Sets and Maps.
- Automated a build setup for Jenkins to create images from Bitbucket and deploy with CI/CD setup.
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Experience in designingCloud Azure Architecture and Implementation plans for hosting complex application workloads on MS Azure.Evolved in Spark Scala functions for mining data to provide real time insights and reports.
- Experienced in creating data pipeline integratingKafkawithspark streamingapplication usedscalafor writing applications.
- Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and SQL with Databricks.
- Worked on syncing Oracle RDBMS to Hadoop DB (HBase) while retaining oracle as the main data store.
- Created POC to conduct the possibility of the prerequisite as require information Disclosure Profiling Post signoff plan essential questions to analyze the existing information.
Environment: Hive, scoop, Java, MySQL, oracle, Spark, MapReduce, pandas, NumPy, SciPy, Matplotlib, Scikit-learn, NLTK, Apache flume, NoSQL, HDFS, HBase, Kafka, Azure, Jenkins, JSON, BSON, PySpark, ETL, AKS, RDBMS
Confidential, TX
Data engineer
Responsibilities:
- Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
- Implemented simple to complex transformation on Streaming Data and Datasets. Worked on analyzing Hadoop cluster and different big data analytic tools including Hive, Spark, Python, Sqoop, flume, Oozie.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Installed and configured with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, and Ambari.
- Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.
- Gathered functional requirements & developed QlikView dashboards.
- BuildingQVD’sfrom different data sources like SQL Server, Oracle, Excel, CSV, and Flat Files.
- DevelopedMapReduce/Spark Pythonmodules for machine learning & predictive analytics inHadoopon AWS. Implemented a Python-based distributed random forest viaPythonstreaming.
- Created complex SQL queries and scripts to extract, aggregateand validatedata from MS SQL, Oracle, and flat files using Informatica and loaded into a single data warehouse repository.
- Migrated the Application on to AWS Cloud.
- Created MapReduce running over HDFS for data mining and analysis using R and Loading & Storage data to Pig Script and R for MapReduce operations.
- Designed and developed Qlik Sense reports on the firms Platinum Clients investment behavior, growth opportunities, and areas of interest.
- Designed KPI and governance reporting utilizing Qlik Sense dashboards and sheet objects (Multi-boxes, text objects, scatter charts, bar charts, and containers)
- SDLC - from analysis to production implementation, with emphasis on identifying the source and source data validation, implementing business logic and used transformations as per the requirement in developing mappings and loading the data into the target.
- Good Understanding of Data ingestion, Airflow Operators for Data Orchestration and other related python libraries.
- Created Airflow Scheduling scripts in Python.
- Working on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Design dimensional model, data lake architecture, data vault 2.0 on Snowflake and used Snowflake logical data warehouse for compute.
- Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.
- Scheduled different Snowflake jobs using NiFi and Migrated the data from Redshift data warehouse to Snowflake.
- Design, Implement and maintain Database Schema, Entity relationship diagrams, Data modeling, Tables, Stored procedures, Functions and Triggers, Constraints, clustered and non-clustered indexes, partitioning tables, Schemas, Functions, Views, Rules, Defaults and complex SQL statement for business requirements and enhancing performance.
- Consumed Extensible Markup Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
- Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster.
- DevelopedMapReduce/Spark Pythonmodules for machine learning & predictive analytics inHadoop.
- Leveraged ETL methods for ETL solutions and data warehouse tools for reporting and analysis.
- Used CSV Excel Storage to parse with different delimiters in PIG.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Worked with variousTeradata15tools likeTeradata Viewpoint, Multi-Load, BTEQ, Teradata Administrator,and TeradataUtilities.
Environment:Python, SQL, pig, R, Informatica, regression, Flume, Oozie, Scala, Spark, Kafka, MongoDB, Hadoop, Hive, Teradata15, Zookeeper, XML, Snowflake, HDFS, ETL, QlikView, QlikSense, Airflow, MapReduce, CSV Excel, JSON, AWS.
Confidential
Data Engineer
Responsibilities:
- Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.
- Implement software enhancements to port legacy software systems to Spark and Hadoop ecosystems on Azure Cloud.
- Used Pig as ETL tool to do transformations, event joins, filters, and some pre-aggregations before storing the data onto HDFS.
- Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ER Studio.
- Analyzed existing systems and propose improvements in processes and systems for usage of modern scheduling tools like data bricks and migrating the legacy systems into an Enterprise data lake built on Azure Cloud.
- Working on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.
- Optimizing pig scripts, user interface analysis, performance tuning and analysis.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Implemented the import and export of data using XML and SSIS.
- Involved in Planning, Defining and Designing data base using ER Studio on business requirement and provided documentation.
- Responsible for migration of application running on premise onto Azure cloud.
- Used SSIS to build automated multi-dimensional cubes.
- Developed Sqoop occupations for information ingestion, steady information loads from RDBMS to Snowflake.
- Wrote indexing and data distribution strategies optimized for sub-second query response.
- Developed a statistical model using artificial neural networks for ranking the students to better assist the admission process.
- Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
- Developed custom airflow operators using Python to generate and load CSV files into GS from SQL Server and Oracle databases.
- Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data modeling tools like ER Studio.
- Managed local deployments in Kubernetes, creating local cluster and deploying application containers.
- Building/Maintaining Docker container clusters managed byKubernetes, Linux, Bash, GIT, Docker.
- Utilized Kubernetes and Docker for the runtime environment of theCI/CDsystem to build, test deploy.
- Developed and implemented Kubernetes manifests, helm charts for deployment of microservices into k8s clusters.
- Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
- Built analytical dashboards to track the student records and GPAs across the board.
- Used deep learning frameworks like MXNet, Caffe 2, TensorFlow, Theano, CNTK and Keras to help clients build Deep learning models.
- Participated in requirements meetings and data mapping sessions to understand business needs.
Environment:ER Studio, AWS, OLTP, Teradata, Sqoop, MongoDB, MySQL, HDFS, Linux, Shell, scripts, SSIS, SSAS, HBase, Azure, MDM, Airflow.
Confidential
Big data Developer
Responsibilities:
- Extensively worked on building the data items for CMS applications, by building up the PLSQL methodology and bundles for every episode of care module.
- Worked intimately with Business analysts and architects to assemble the prerequisites and implement the designing work to make final data sets.
- Chipped away at basic change solicitations and quarterly EOC rushes to give the information to end-clients for information mining and business knowledge announcing.
- Made External Tables to stack information from the outer documents to Oracle source tables.
- Made Complex PySpark Packages utilizing SSIS to extricate information from organizing tables with an implement load.
- Worked on SSIS Package, DTS Import/Export for moving data from Database (Oracle and Text design data) to SQL Server.
- Created Spark jobs utilizing Spark SQL and Data Frames API to handle structured data into Spark cluster.
- Have re-revised the methodology to decrease the repetition and improved the performance of the systems by tuning the code. Utilizing Explain Plan, Correct Joins, Indexing.
- Implemented Spark utilizing Python and used Data Frames and Spark SQL API for quicker processing of data.
- Created global temporary tables to improve the inquiry execution by taking out the repeated reference of the principal tables in the procedures and run time improvements.
- Ingested Images reactions in Kafka producer written in Python.
- Develop SQL queries using stored procedures, common table expressions (CTEs), temporary table to support SSRS and Power BI reports
- Worked with DevOps team to Culturize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes.
- Work with Continuous Integration (CI)/CD using Jenkins for timely builds and running Tests.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup's for Data Validation and Integrity.
- Developed Spark code utilizing Scala and Spark-SQL/Streaming for quicker testing and handling of information.
- Defined Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT) measures for the Data Lake.
- Enforced YARN Resource pools to share resources of cluster for YARN jobs submitted by users.
- Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's, Spark YARN.
- Used Spark to export transformed streaming datasets into Redshift on AWS cloud.
- Created Lambda to process information from S3 to Spark for organized
Environment: PLSQL, ETL Packages, RDBMS, Hadoop, Spark, Kafka, HBase, Python, Scala, SQL, EC2, SSIS Packages, Oracle, EC2 Nodes, NiFi, MySQL, EOCWissen
Confidential
Software developer
Responsibilities:
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, HBase, Hive.
- Installed and configured Kafka producer to ingest data from a Kafka broker.
- Installed and configured Spark consumer to stream data from Kafka Producer and used Spark to migrate the data to HBASE.
- Used Spark SQL and DataFrames docker to process structured and semi structured information into Spark Clusters.
- Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, Code Pipeline)
- Familiarity with Hive joins & used HQL for querying the databases eventually leading to complex Hive UDFs.
- Installed OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
- Implemented end-to-end data pipeline using FTP Adaptor, Spark, Hive, and Impala.
- Implemented Spark using Scala utilized Spark SQL heavily for faster development, and processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Extensive experience in Building, publishing customized interactive reports and dashboards, report scheduling using Tableau Desktop and Tableau Server.
- Involved in Installation and upgrade of Tableau server and server performance tuning for optimization.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
Environment: Hadoop, Hive, HBase, Spark, SQL, Tableau, Jenkins, Kafka, CDH5, Spark sql, Git, MapReduce, HQL, Linux, ETL, Cloudera.