Sr. Data Engineer Resume
Charlotte, NC
SUMMARY
- Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Cross - functional technical professional with progressive experience in Big Data systems Administration, Linux System, Security and technical support within large-scale enterprise IT portfolios, service projects.
- Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self-motivated with a strong adherence to personal accountability in both individual and team scenarios.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Extensively usedPythonLibraries PySpark, Pytest, Pymongo, cx Oracle, PyExcel, Boto3, Psycopg, embedLy, NumPy and Beautiful Soup.
- Hands-on use of Spark andScalaAPI to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
- ESkilled in Tableau Desktop versions 10x for data visualization, Reporting and Analysis.
- Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
- Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development.
- Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, CloudFront, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, OpsWorks, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and MongoDB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Well experience in Normalization and Denormalization techniques for optimum performance in relational and dimensional database environments.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored them in AWS S3.
- Hands-on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
- Experience with operating systems: Linux, RedHat, and UNIX.
- Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Excellent programming skills with experience in Java, C, SQL and Python Programming.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets,JSP, Struts, Spring, Hibernate and Web services.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle 12c/11g, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Sr. Data Engineer
Responsibilities:
- Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
- Partner with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purposes by Pig.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into a data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,DataFrame,OpenShift,pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
- Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Utilized Agile and Scrum methodology for team and project management.
- Used Git for version control with colleagues.
- Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
- Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
- Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source
Environment: Hdfs,Hive,Spark (PySpark, SparkSQL, SparkMLIib), Kafka,linux,Python 3.x(Scikit-learn,Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig. Json and Parquet File systems. Map Reduce Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie
Confidential, PA
Big Data Engineer/Data Analyst
Responsibilities:
- Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Teradata and DB2 to Big Query (Google Cloud Platform).
- Performed data manipulation on extracted data using Python Pandas.
- Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
- Built customtableau dashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for that selected object.
- Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
- Design scoop scripts to load from Teradata and DB2 to Hadoop environment and also design Shell scripts to transfer data from Hadoop to Google Cloud Storage (GCS) and from GCS to Big Query.
- Validate Scoop jobs, Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.
- Develop best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Prepare data migration plans including migration risk, milestones, quality and business sign-off details.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI
- Oversee the migration process from a business perspective. Coordinate between leads, process manager and project manager. Perform business validation of uploaded data.
- Worked on to retrieve the data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Hadoop, Map Reduce, AWS Lambda, Azure, ADF,Snowflake, HDFS, Hive, My SQL, SQL Server, Tableau, Spark, SSIS.,Scoop jobs, Shell scripts, Ab initio ETL, Data Mapping
Confidential, IL
Data Engineer
Responsibilities:
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
- Strong understanding of AWS components such as EC2 and S3
- Implemented a Continuous Delivery pipeline with Docker and GitHub
- Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
- Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
- Experience working on entire AWS/Azure/GCP stack (Data Store, Big Query, Big Table, Google Storage, AWS Glue, S3, Kinesis Data Analytics, EMR, Red shift, ADLS, HD Insight, Elastic search, Dataflow).
- Devised simple and complex SQL scripts to check and validate Data Flow in various applications.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop Clusters which are set up in AWS EMR.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used Python pandas,Nifi,Jenkins,nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
- Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modeling.
- Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analyze the CT scan pictures to figure out the disease in CT scan.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Used Git for version control with the Data Engineer team and Data Scientists colleagues.
- Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
- Developed and deployed data pipeline in cloud such as AWS and GCP
- Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
- Big Data - Hadoop (Map Reduce & Hive), Spark (SQL, Streaming), Azure Cosmos DB, SQL Data warehouse, Azure DMS, Azure Data Factory, AWS Red shift, Atana Lambda, Step Function and SQL.
- Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling
- Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Hands on experience in installation, configuration, management and development of big data solutions using Horton works distributions.
- Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Developed logistic regression models (Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
- Develop near real time data pipeline using spark
- Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow with Python
- Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library
- Hands on experience with big data tools like Hadoop, Spark, Hive
- Experience implementing machine learning back-end pipeline with Pandas, Numpy
Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spark
Confidential
Sr. Java/J2EE Developer
Responsibilities:
- Involved in design and development using Java, J2EE, Web 2.0 technologies, Liferay Portal
- Server 5.1, Liferay Portal environment.
- Created the UI interface using JSP Struts, JavaScript, CSS, and HTML.
- Designed and Implemented MVC architecture using Struts Framework, which involved writing Action Classes/Forms/Custom Tag Libraries & JSP pages.
- Used AngularJS controllers (JavaScript object) to control page data and Models to bind data in user interface with controller and used Custom AngularJS Filters to provide better search experience.
- Wrote application-level code to perform client-side validation using JQUERY, JavaScript.
- Designed the front-end applications, user interactive (UI) web pages using web technologies like HTML, XHTML, CSS, Bootstrap, and other frameworks.
- Designed, developed, and maintained the data layer using Hibernate and performed configuration of Spring Application Framework.
- Hands on experience in installation, configuration, management and development of big data solutions using Horton works distributions.
- Integration of mainframe applications using JMS/Web Sphere/SQLA.
- Design and Java coding of business components as Enterprise Java Beans (EJB) and exposed as web services.
- Deployed the Jar, war files, and session beans etc. on WebSphere Application server.
- Used Webservices to extract client related data from databases using WSDL, XML, and SOAP.
- Worked with QA team to design test plan and test cases for User Acceptance Testing (UAT).
- Used Apache Ant to compile java classes and package into jar/war archives, involved in
- Low-Level and High-Level Documentation of the product
Environment: Java 1.6, J2EE5, Spring 2.1, Spring AOP, Hibernate 3.x, Struts Framework, XMLEJB3.0, JMS, MDB, JSP, JSF, JSF Rich Faces, Swing, SVN, PVCS, RAD 7.0, CSS, AjaxDOJO, WebSphere 6.0, JUnit.