Sr. Data Engineer Resume
Irving, TX
SUMMARY
- 8+ years of IT experience as a Python/ Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
- Expert in R and Python scripting. Worked in stats function with NumPy, visualization using Matplotlib/Seaborn and Pandas for organizing data.
- Experience in Scala and spark. Experience in using various packages in R and python like ggplot2, dplyr, plyr, SciPy, scikit - learn, Beautiful Soup, Rpy2.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modelling, and data munging.
- Create clusters inGoogle Cloudand manage the clusters using Kubernetes(k8s). Using Jenkins to deploy code to Google Cloud, create new namespaces, creating docker images and pushing them tocontainer registryofGoogle Cloud.
- Experience onApache HadoopEcosystem with good knowledge of Apache Hadoop Distributed file system (HDFS),Map Reduce, Hive, Pig, Python, HBase, Sqoop, Kafka, Flume, Cassandra, Oozie, Impala, Spark.
- Experience as a Web/ApplicationDeveloperand coding with analytical programming usingPython, Django, Java and Various JavaScript frameworks (AngularJS, Typescript, NPM, React JS, Redux, D3.JS, Vue.JS, jQuery, and Ext JS).
- Expertise in creating Scrum stories, Agile Methodologies, Sprints experience inPythonbased environment, along with data analytics, Excel data extracts and data wrangling.
- Developed mappings/sessions using Informatica Power Center 8.6 for data loading.
- Experience in all phases of Software Development Life Cycle (SDLC), which includes Requirements Specification, Design documents, Integration, Documentation and writing Test cases using various software engineering process ranging from Waterfall to Agile methodology.
- Created shell scripts to fine tune the ETL flow of the Informatica workflows.
- Experience usingpythonlibraries for machine learning like pandas, numpy, matplotlib, sklearn, scipy to Loading the dataset, summarizing the dataset, visualizing the dataset, evaluating some algorithms and making some predictions
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in developing RESTful API's using Django REST Framework.
- Experience in Implementation of MVC architecture using Servlet, Django, and RESTful, SOAP webservice and SOAPUI.
- Hands-on experience in developing web applications and RESTful web services and APIs using Python Flask, Django and PHP.
- Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
- Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, Spark SQL.
- Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
- Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from various sources, prepared data for data exploration using data munging and Teradata.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Worked with variousRDBMSlikeOracle, MYSQL, SQL Server, DB2, Teradataand expertise in creating tables, data population and data extraction from these databases.
- Expertise inSQL Queriesto extract data from data tables along with creation oftables, Sub queries, Joins, Views, Indexes, SQL Functions, Set Operatorsand other Functionalities.
- Strong Experience in implementing Data warehouse solutions inAmazon Redshift,Oracle,andSQL Server.
- Experience in extracting, transforming, and loading (ETL) data from spreadsheets, database tables, flat files and other sources usingTalend Open StudioandInformatica
TECHNICAL SKILLS
Programming Languages: Python, SQL, Java, R Programming and C
Web Technologies: HTML/HTML5, CSS/CSS3, XML, jQuery, JSON, Bootstrap, AngularJS
Python Libraries/Packages: NumPy, SciPy, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQL Alchemy, HTTPLib2, Urllib2, Beautiful Soup, Py Query
IDE: Jupyter Notebook, PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans, Sublime Text, Visual Code
Cloud Computing: AWS, OpenStack, GCP
AWS Services: EMR, EC2, S3, Lambda, APIGateway, Athena, Kinesis, SQS, SNS, Cloudwatch
Databases/Servers: MySQL, SQLite3, Cassandra, Redis, PostgreSQL, CouchDB, MongoDB, Teradata, Apache Web Server 2.0, NginX, Tomcat, JBoss, WebLogic
ETL: Informatica, Data Stage, SSIS
Web Services/ Protocols: TCP/IP, UDP, FTP, HTTP/HTTPS, SOAP, Rest, Restful
Build and CI tools: Docker, Kubernetes, Maven, Gradle, Jenkins, Hudson, Bamboo
SDLC/Testing Methodologies: Agile, Waterfall, Scrum, TDD
PROFESSIONAL EXPERIENCE
Confidential, Irving TX
Sr. Data Engineer
Responsibilities:
- Migrating an entire oracle database to BigQuery and using of power bi for reporting.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
- Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
- Experience in moving data between GCP and Azure using Azure
- Experience in building power bi reports on Azure Analysis services for better performance
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets
- Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering, and skewing.
- Work related to downloading BigQuery data into pandas or Spark data frame
- Worked with google data catalog and other google cloud APIs for monitoring, query and billing related analysis for BigQuery usage.
- Worked on creating POC for utilizing the ML models and Cloud ML for table Analysis for the batch process.
- Knowledge about cloud dataflow and Apache beam.
- Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
- Good knowledge in using cloud shell for various tasks and deploying services.
- Created Big Query authorized views for row level security or exposing the data to other teams
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, Apache Spark, with Cloudera Distribution.
Environment: Bigdata/Hadoop, Oracle12c, SQL, PL/SQL, API, HBase, NoSQL, Python, Pyspark, ADF, SaaS, Erwin, Kafka, Spark, SSIS, Map/Reduce, ETL, Pub/sub, SSRS, Tableau, Oozie, Teradata, Dataflow, Cloud functions, Big Query, SDK, Apache beam, Dataproc.
Confidential, Nashville, TN
Sr Data Engineer
Responsibilities:
- Extensively handled Big Data using Hadoop eco system components like SQOOP, PIG and HIVE for Data pipeline design.
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflake’s SnowSQL.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Used Snowflake functions to perform semi structures data parsing entirely with SQL statements.
- Performed a key role in understanding the business requirements for migrating data to data warehouse.
- Helped individual teams to set up their repositories in git bucket and maintain their code and help them setting up job which can make use of CI/CD environment.
- Using rest API with Python to ingest Data from and some other site to BIGQUERY.
- Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Big query tables.
- Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Big query and load it in Big query.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Deployment(CI/CD)pipeline for fast-paced robust application development environment.
- Creating job flow using Airflow in Python language and automating the jobs. Airflow will have separate stack for deploying DAG’s and on and will run jobs on EMR or EC2 cluster.
- Created logging forETLload Confidential package level and task level to log number of records processed by each package and each task in a package usingSSIS.
- Used DataStage as an ETL tool to extract data from sources systems, loaded the data into theORACLEdatabase.
- Compared the performance of the Hadoop based system to the existing processes used for preparing the data for analysis
- Worked on real time data integration using Kafka, Spark streaming and HBase.
- Performed unit testing Confidential various levels of the ETL and actively involved in team code reviews.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Designed and implemented effective Analytics solutions and models with Snowflake.
- Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
- Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
- Performed Data integrity, validation and testing on the data migrated into the data warehouse.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Assist ETL developers with specifications, documentation, and development of data migration mappings and transformations for Data Warehouse loading.
- DesignedSSIS Packagesto transfer data from flat files, Excel SQL Server using Business Intelligence Development Studio.
- Migrated HiveQL queries into SparkSQL to improve performance.
Environment: Bigdata/Hadoop, Oracle12c, SQOOP, PIG HIVE, SQL, PL/SQL, API, HBase, NoSQL, Python, Pyspark, ADF, SaaS, Erwin, Kafka, Spark, SSIS, Map/Reduce, ETL, SSRS, Tableau, Oozie, Teradata.
Confidential, Austin, TX
Python/Data Engineer
Responsibilities:
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
- Worked in designing tables in Hive, MYSQL using SQOOP and processing data like importing and exporting of databases to the HDFS, involved in processing large datasets of different forms including structured, semi-structured and unstructured data.
- Created Airflow Scheduling scripts in Python.
- Importing and exporting data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Analysed SQL scripts and designed the solutions to implement using PySpark.
- Used JSON and XML for serialization and de-serialization to load JSON and XML data into Hive tables.
- Used Spark SQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Integrated services likeGitHub, AWSCodePipeline, Jenkins and AWS Elastic Beanstalk to create a deployment pipeline.
- Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
- Process and load bound and unbound Data from Google pub/subtopic to Big query using cloud Dataflow with Python.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Installed and configured apache airflow for workflow management and created workflows in python.
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Expertized in implementing Spark usingScalaandSpark SQLfor faster testing and processing of data responsible to manage data from different sources.
- To achieve Continuous Delivery goal on high scalable environment, used Docker coupled with load-balancing tool Nginx.
Environment: Hadoop, Hive, Sqoop, Pig, java, Django, Flask, XML, MySQL, MS SQL Server, Linux, Shell Scripting, Mongo dB, SQL, Python 3.3, Django, HTML5/CSS, Cassandra, JavaScript, PyCharm, GIT, Linux, Shell Scripting, RESTful, Docker, Jenkins, JIRA, jQuery, MySQL, Bootstrap, HTML5, CSS, AWS, EC2, S3.
Confidential, Bothell, WA
Python Developer
Responsibilities:
- Understand the business process variants and created the process flow for automating the adhoc request.
- Developed the map-reduce flows in Microsoft HDInsight hadoop environment using python.
- DevelopedHive UDFsandPig UDFsusing Python in Microsoft HDInsight environment.
- Worked on development ofSQLand stored procedures onMYSQL.
- Involved in building database Model, APIs and Views utilizingPython,in order to build an interactive web-based solution.
- Coding inPython(Linux, MySQL) environment.
- Development ofDatastagedesign concepts, execution, testing and deployment on the client server.
- UsedPythonto extract weekly hotel availability information from CSV files.
- DevelopedPythonbatch processors to consume and produce various feeds.
- UsedPandas APIto put the data as time series and tabular format for east timestamp data manipulation and retrieval.
- Involved in designing and developing AmazonEC2, AmazonS3, AmazonRDS, AmazonElastic Load Balancing, AmazonSWF, AmazonSQS, and other services of theAWS infrastructure.
- Managed large datasets usingPandadata frames andMySQL.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Designed and used shell scripts that automate thedatastagejobs and validate files.
- Participated in requirement gathering and worked closely with the architect in designing and modeling.
- Generated graphical reports using python packageNumpyandMatPlotLib.
- Representation of the system in hierarchy form by defining the components, subcomponents usingPythonand developed set of library functions over the system based on the user needs.
- Development ofPython APIsto dump the array structures in the Processor Confidential the failure point for debugging.
- Extracted the actual data ofHTMLformat and predicted raw data; interpreted and stored predicted one in well-organizedJSON files.
Environment: Python 2.7, Django 1.6, Tableau 8.2, Beautiful soup, HTML5, CSS/CSS3, Bootstrap, XML, JSON, JavaScript, JQuery, Angular JS, Backbone JS, Restful Web services, Apache, Linux, GIT, Jenkins.
Confidential
Programmer Analyst
Responsibilities:
- Built database Model, Views and API's usingPythonfor interactive web-based solutions.
- Placed data into JSON files usingPythonto test Django websites.
- UsedPythonscripts to update the content in database and manipulate files.
- Developed web-based application using Django framework withpythonconcepts.
- GeneratedPythonDjango forms to maintain the record of online users.
- Extensively used SQL, Numpy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building.
- Involved inPythonOOP code for quality, logging, monitoring, and debugging code optimization.
- UsedSpark-SQLto LoadJSONdata and createSchema RDDand loaded it intoHiveTables and handled Structured data usingSparkSQL.
- Explored different implementations in hadoop environment for data extraction and summarization by using packages likeHive, Pig.
- Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries.
- Developed efficient Angular.js for client web-based application.
- Implement CICD(Continuous Integration and Continuous Development) pipeline for Code Deployment.
- Responsible for designing, developing, testing, deploying and maintaining the web application.
- Designed and developed the UI for the website with HTML, XHTML, CSS, Java Script and AJAX.
- Design, develop, and deploy engaging web applications usingPython.
- WrotePythoncode embedded with JSON and XML to produce HTTP GET request, parsing HTML data from websites.
Environment: Python, HTML, CSS, Bootstrap, JavaScript, MongoDB, Linux, APIs, GIT.
Confidential
System Analyst
Responsibilities:
- Designed and developed the UI of the website usingHTML,XHTML,AJAX, CSS and JavaScript.
- Involved in the complete Software Development Life Cycle (SDLC) including Requirement Analysis, Design, Implementation, Testing and Maintenance.
- Built REST APIs to easily add new analytics or issuers into the model.
- Automate different workflows, which are initiated manually with Python scripts and Unix shell scripting.
- Used Hibernate formapping data representation from MVC model to Oracle Relational data model with a SQL-based schema.
- Used Pandas as API to put the data as time series and tabular format for manipulation and retrieval of data.
- Helped with the migration from the old server to Jira database (Matching Fields) with Python scripts for transferring and verifying the information.
- Implemented multithreading for parallel processing of requests using various features of Concurrent API.
- Worked on Oracle 11g databases and wrote SQL queries as well as stored procedures for the application.
- Assisted with production support activities using JIRA when necessary to help identify and resolve escalated production issues based on the SLA.
- Used Spark and SparkSQL for data integrations, manipulations. Worked on a POC for creating a docker image on azure to run the model.
- Wrote documents in support of the SDLC phases. Documents include requirements and analysis reports, design documents, and technical documentation.
- Extensively used Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn and NLTK.
- Performed Exploratory Data Analysis, trying to find trends and clusters.
- Studying the existing environment and accumulating the requirements by interacting with the clients on various aspects.
Environment: Java, JSP, Servlets, spring, HTML, CSS, AJAX, Hibernate, XML, Maven, Oracle, JavaScript, Eclipse.