Senior Data Engineer Resume
Madison, Wi
SUMMARY:
- Around 10+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
- Experience in NoSQL databases and worked on table row key design and to load and retrieve data for real time data processing and performance improvements based on data access patterns.
- Extensive experience in Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts.
- Experience in building large scale highly available Web Applications. Working knowledge of web services and other integration patterns.
- Developed Simple to complex MapReduce and Streaming jobs using Java and Scala language.
- Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis. EMR with Hive to handle less important bulk ETL jobs.
- Hands - on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
- Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development.
- Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage pattern.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle).
TECHNICAL SKILLS:
Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase
Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++
Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5
Data Bases: MySQL, SQL Server, Oracle, MS Access
NoSQL Data Bases: MongoDB, Cassandra, HBase
Workflow mgmt. tools: Oozie, Apache Airflow
Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend
Cloud Technologies: Azure, AWS
IDE s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ
Version Control Systems: Git, SVN
Operating Systems: Unix, Linux, Windows
PROFESSIONAL EXPERIENCE:
Senior Data Engineer
Confidential, Madison, WI.
Responsibilities:
- Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
- Agile methodology including test-driven and pair-programming concept.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
- Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
- Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
- Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
- Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
- Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
- Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
- Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
- Architect and design serverless application CI/CD by using AWS Serverless (Lamda) application model.
- Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
Environment: Hadoop, Map Reduce, HDFS, Hive, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend Big Data Integration, Solr, Impala.
Sr. Data Engineer
Confidential - Indianapolis, In
Responsibilities:
- Analyzed company’s business practice and product performance to determine potential growth.
- Involved in creating database objects - Tables, Indexes, Views, User defined functions, Parameterized Stored Procedures.
- Wrote many stored procedures for cleaning, manipulating and processing data between the databases.
- Experienced in strategically implementing the indexes such as Clustered index, non-clustered index, covering index appropriately on data structure to achieve faster data retrieval.
- Extensively used SQL queries to translate data into valuable information for decision making.
- Implemented advanced SQL queries usingDML and DDL statementsto extract large quantities of data from multiple data points onSQL Servers.
- Extracting the data from the different sources(CSV files, flat files, Excel files and MS SQL)and stored the data into the intermediateStaging Database using the SSIS ETL tool.
- Created highly complex SSIS packages using various Data transformations like conditional splitting, Lookup, For Each Loop, Error handling.
- Defined best practices for Tableau report development.
- Defined best practices for Tableau report development and effectively used data blending feature intableau.
- Created SSIS packages to implement error/failure handling with row redirects.
- Developed Cubes usingSQL Analysis Services (SSAS) and Dimensions using cube wizard and Generated Named calculations and named queries.
- Experience in Designing, Building the Dimensions, cubes with star schema, snowflakes schema usingSQL Server Analysis Services (SSAS) for analyzing purpose.
- Created partitions and designed aggregations in Cubes.
- Involved in Cube Partitioning, Refresh strategy and planning and Dimensional data modeling inAnalysis Services (SSAS)
- Designed and implemented data models and reports in Power BI to help clients analyze data to identify market trends, competition and customer behavior.
- Developed custom calculated measures using DAX inPower BIto compare the company’s industry wide.
- Created different visualization(Stacked bar Chart, Clustered bar Chart, Scatter Chart, Pie Chart,Donut Chart, Line & Clustered Column Chart, Map, Slicer, Time Brush etc.) in Power BI according to the requirements.
- Developeddashboardsincluding various charts- bar charts, pie charts, scatter charts, bubble charts and shape maps.
- Created Page Level Filters, Report Level Filters, Visual Level Filters in Power BI according to the requirements.
- Involved in reporting of KPI and gauges for some parameters to keep track of close competitors in the industry.
- Provided in-depth market research reports for qualitative and quantitative methods.
- Provided SWOT analysis on financial data usingPower BI charts.
- Recommended new services based on data driven analysis.
Environment: MS SQL Server 2012/ 2014/ 2016, Microsoft Management Studio, Microsoft Visual Studio,Power BI, Power Pivot, Tableau, DAX.
Sr. Data Engineer
Confidential - Frisco, TX
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
- Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
- Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, MapReduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, numpy, scipy, Pandas, scikit-learn.
- UtilizedSpark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Pythonand utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, and Load) tools and methodologies to collect of data from various sources into a single data warehouse.
- Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
- Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
- Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
- Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper.
Data Engineer
Confidential - San Diego, CA
Responsibilities:
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups.
- Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
- Written Kafka REST API to collect events from front end.
- Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Involved in running all the hive scripts through Hive. Hive on Spark and some through Spark SQL.
- Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
- Worked on product positioning and messaging that differentiate Hortonworks in the open source space.
- Experience in design and developing Application leveraging MongoDB.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Involved in complete Big data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
- Troubleshooting the Azure Development, configuration and Performance issues.
- Interacted with multiple teams who are responsible for Azure Platform to fix the Azure Platform Bugs.
- Providing 24/7 support for on-call on Azure configuration and Performance issues.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers.
- Used Jira for bug tracking and Bitbucket to check-in and checkout code changes.
Environment: Scala, Azure, HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting.
Data Engineer
Confidential, Tysons, VA
Responsibilities:
- Experience creating and organizing HDFS over a staging area.
- Imported Legacy data from SQL Server and Teradata into Amazon S3
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Wrote Python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
- Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
- Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.
- Utilized Pandas to create a data frames
- Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
- Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
- Developed a raw layer of external tables within S3 containing copied data from HDFS.
- Created a data service layer of internal tables in Hive for data manipulation and organization.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
Environment: HDFS, AWS, SSIS, Snowflake Hadoop, Hive, Hbase, MapReduce, Spark, Sqoop, Pandas, MySQL, SQL Server, PostgreSQL, Teradata, Java, Unix, Python, Tableau, Oozie, Git.
Data Analyst
Confidential
Responsibilities:
- Understand the data visualization requirements from the Business Users.
- Writing SQL queries to extract data from the Sales data marts as per the requirements.
- Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
- Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
- Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
- Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
- Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.
Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
Data Analyst
Confidential
Responsibilities:
- Processed data received from vendors and loading them into the database. The process was carried out on a weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
- Documented requirements and obtained signoffs.
- Coordinated between the Business users and development team in resolving issues.
- Documented data cleansing and data profiling.
- Wrote SQL scripts to meet the business requirement.
- Analyzed views and produced reports.
- Tested cleansed data for integrity and uniqueness.
- Automated the existing system to achieve faster and accurate data loading.
- Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
- Used Informatica Data Transformations like Source Qualifier, Aggregator, Joiner, Normalizer, Rank, Router, Lookup, Sorter, and Reusable, Transaction control, etc., to parse complex files and load them into databases.
- Created complex SQL queries and scripts to extract, aggregate, and validate data from Oracle, MS SQL, and flat files using Informatica and loaded it into a single data warehouse repository for data analysis.
- Learned to create Business Process Models.
- Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
- Good Understanding of clinical practice management, medical and laboratory billing, and insurance claim with processing with process flow diagrams.
- Assisted QA team in creating test scenarios that cover a day in the life of the patient for Inpatient and Ambulatory workflows.
Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Informatica