Senior Data Engineer Resume
Dallas, TexaS
SUMMARY
- Overall, 9+ years of professional experience in IT and using BIGDATA, HADOOP framework and Analysis, Design, Development, Documentation, Deployment, and Integration using SQL and Big Data technologies and with Hadoop Ecosystem in installation and administrated of all UNIX/LINUX servers and configuration of different Hadoop eco - system components in the existing cluster project.
- Experience in implementing various Big Data Analytical, Cloud Data engineering, and Data Warehouse / Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization solutions.
- Have proven track record of working as Data Engineer on Amazon cloud services, Google Cloud Platform, Snowflakes, Bigdata/Hadoop Applications, and product development.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience on Teradata tools and utilities (BTEQ, Fast load, Multi Load, Fast Export)
- Well versed with big data on AWS cloud services i.e., EC2, S3, Glue, Anthena, DynamoDB and RedShift
- Experience in job/workflow scheduling and monitoring tools like Oozie, AWS Data pipeline & Autosys
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Experience working on creating and running Docker images with multiple micro - services.
- Good experience in deploying, managing, and developing with MongoDB clusters.
- Docker container orchestration using ECS, ALB and lambda.
- Experience with Unix/Linux systems with scripting experience and building data pipelines.
- Responsible for migration of application running on premise onto Azure cloud.
- Experience on Cloud Databases and Data warehouses (SQL Azure and Confidential Redshift/RDS)
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Proficient in batch processing of the data via mongo dB, Solr, Stream processing of the data via Strom API, java
- Played a key role in migrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies
- Strong SQL development skills including writing Stored Procedures, Triggers, Views, and User Defined functions.
- Expertise in usingLinear & Logistic Regressionand Classification Modeling, Decision-trees, Principal Component Analysis (PCA), Cluster and Segmentation analyses, and have authored and co-authored several scholarly articles applying these techniques.
- Experienced inMachine Learning ClassificationAlgorithms like Logistic Regression, K-NN, SVM, Kernel SVM, Naive Bayes, Decision Tree & Random Forest classification
- Hands on experience inAzureDevelopment, worked onAzure web application,App services,Azure storage,Azure SQL Database,Virtual machines,Fabric controller,Azure AD, Azure search, and notification hub.
TECHNICAL SKILLS
Modeling Tools: IBM Infosphere, SQL Power Architect, Oracle Designer, Erwin 9.6/9.5, ER/Studio 9.7, Sybase Power Designer.
Database Tools: Oracle 12c/11g, MS Access, Microsoft SQL Server 2014/2012 Teradata 15/14, Poster SQL, Netezza.
Big Data Technologies: Hadoop, HDFS 2, Hive, Pig, HBase, Sqoop, Flume.
Cloud Platform: AWS, EC2, S3, SQS, Azure, GCP, Snowflakes.
Operating System: Windows, Dos, Unix, Linux.
BI Tools: SSIS, SSRS, SSAS.
Reporting Tools: Business Objects, Crystal Reports.
Tools: & Software TOAD, MS Office, BTEQ, Teradata SQL Assistant.
ETL Tools: Pentaho, Informatica Power 9.6, SAP Business Objects XIR3.1/XIR2, Web Intelligence.
Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio, and MS Office, have worked on C++, UNIX, PL/SQL etc.
PROFESSIONAL EXPERIENCE
Senior Data Engineer
Confidential, Dallas, Texas
Responsibilities:
- Analyzing large amounts of data sets to determine optimal way to aggregate and report on these data sets.
- Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.
- Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
- Analyzed the requirements and framed the business logic for the ETL process using Talend.
- Developed Jobs in Talend Enterprise edition from stage to source; intermediate, conversion and Target.
- Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
- Converted all existing flat files reports into Tableau visualizations.
- Assisted users in creating and modifying worksheets and data visualization dashboards.
- Working on generating various dashboards in Tableau Desktop using various sources such as SQL Server/Teradata/Oracle/Excel/Text data etc.
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HFS using PySpark.
- Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
- Performed review and analysis of the detailed system specifications related to the DataStage ETL and related applications to ensure they appropriately address the business requirements.
- Evaluated impact of proposed changes on existing DataStage ETL applications, processes, and configurations.
- Worked with DataStage Designer for importing metadata from repository, new job categories and creating new data elements.
- Developed Autosys jobs for scheduling and running the DataStage jobs in Production Environment.
- Provisioned the highly available EC2 Instances using Terraform and Ansible Playbooks.
- Managed AWS infrastructure as code using Terraform.
- Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Worked on designing, building, deploying, and maintaining Mongo DB.
- Implemented Data collection and transformation in AWS cloud computing platform using S3, Athena, Glue, Redshift, PostgreSQL, Quicksight.
- Designed and developed Insights reports on AWS Quicksight for client deliverable.
- Development of BI data lake POCs using AWS Services including Athena, S3, Ec2, Glue and Quick sight.
- Assisted in designing & developing data lake and ETL using python and Hadoop ecosystem.
- Involved in designing the Data pipeline from end-to-end, to ingest data into the Data Lake.
- Design dimensional model, data lake architecture, data vault 2.0 on Snowflake and used Snowflake logical data warehouse for compute.
- Developed Python based API (RESTful Web Service) using Flask. Involved in Analysis, Design, and Development and Production phases of the application.
- Developed Python batch processors to consume and produce various feeds.
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Amazon Redshift for large scale data handling Millions of records every day.
- Created automated pipelines in AWS CodePipeline to deployDockercontainers in AWSECSusing serviceslikeCloudFormation,CodeBuild,CodeDeploy,S3andpuppet.
- Involved in data migration to Snowflake using AWS S3 buckets.
- Also converted SQL Server mapping logic to Snow SQL queries.
- Built different visualizations and reports in tableau using Snowflake data.
- Wrote different Snow SQL queries to interact with compute layer and retrieve data from storage layer.
- Responsible for maintaining and tuning existing cubes using SSAS and Power BI.
- Worked on cloud deployments using maven, docker and Jenkins.
- Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.
- Used AWS Glue for the data transformation, validate and data cleansing.
- Used python Boto 3 to configure the services AWS glue, EC2, S3
Environment: Erwin 9.6, Oracle 12c, MDM, MS-Office, SQL, SQL Loader, PL/SQL, DB2, SharePoint, Talend, MS-Office, Redshift, SQL Server, Hadoop, Spark, AWS.
Confidential, NY
Data Specialist
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines.
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
- Setting GitLab repository and Runner for build automation.
- Automation Code version management using Gitlab.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Migrating an entire oracle database to BigQuery and using of power bi for reporting.
- Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
- Created python scripts to ingest data from on-premises to GCS and built data pipelines using Apache Beam and Data Flow for data transformation from GCS to Big query
- Developed ELT jobs using Apache beam to load data into Big Query tables.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Developed pyspark programs and created the data frames and worked on transformations.
- Transform and analyze the data using Pyspark, HIVE, based on ETL mappings.
- Developed PySpark scripts that runs on MSSQL table pushes to Big Data where data is stored in Hive tables.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Setting up and configuring Kafka Environment in Windows from the scratch and monitoring it.
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Implement One time Data Migration of multistate level data from SQL server to Snowflake by using Python.
- Worked on confluence and Jira.
- Created ETL between different data warehousing such as snowflake & Redshift via Alteryx workflow.
- Stage the API or Kafka Data (in JSON file format) into Snowflake DB by Flattening the same for different functional services.
Environment: AWS, Gcp, Big query, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark -Sql
Confidential, Cleveland, Ohio
Data Engineer
Responsibilities:
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
- Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- \Created DWH, Databases, Schemas, Tables, write SQL queries against Snowflake.
- Validated the data feed from the source systems to Snowflake DW cloud platform.
- Integrated and automated data workloads to Snowflake Warehouse.
- Ensured that ETL/ELT's succeeded and loaded data successfully in Snowflake DB.
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler, and database engine tuning advisor to enhance performance.
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using Erwin and subsequent deployment to Enterprise Data Warehouse.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Migrated on premise database structure to Confidential Redshift data warehouse
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Used Hive SQL, Presto SQL, and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
Environment: SQL Server, Erwin, Oracle, Redshift, Informatica, RDS, NOSQL, Snowflake Schema, MySQL, PostgreSQL, AWS, GCP, Azure
Confidential
Big Data Developer
Responsibilities:
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Expertise with NoSQL databases like HBase, Cassandra, DynamoDB (AWS) and MongoDB.
- Involved in migrating java test framework to python flask.
- Responsible for developing data pipeline using HD Insight, flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Experience in working with Ranger in enabling metadata management, governance, and audit.
- Installed and configured Hadoop, MapReduce, HDFS (Hadoop Distributed File System), developed multiple MapReduce jobs in java for data cleaning.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Involved in debugging and tuning the PL/SQL code, tuning queries, optimization for the Oracle, Teradata, and DB2 database.
- Experience in methodologies such as Agile, Scrum and Test-driven development.
- Installed, configured, and administered a small Hadoop cluster consisting of 10 nodes. Monitored cluster for performance and networking and data integrity issues.
- Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Architecture and designed Hadoop 30 nodes Innovation Cluster with SQRRL, SPARK, Puppet, HDP.
- Working with data delivery teams to setup new Hadoop users. This job includes setting up Linux users, setting up Kerberos principals and testing HDFS, Hive.
- Load log data into HDFS using Flume, Kafka and performing ETL integrations.
Environment: Hive, Pig, HBase, Zookeeper and Sqoop, ETL, Ambari, Linux Cent OS, HBase, MongoDB, Cassandra, Ganglia and Cloudera Manager.
Confidential
Linux Administrator
Responsibilities:
- Installing Configuration and maintenance of Solaris 9/10/11, RedHat Linux 4/5/6, SUSE 10.3, 11.1, HP-UX 11.x and IBM AIX operating systems.
- Worked Primarily on RHEL 4/5, HPUX, and Solaris Operating Systems.
- Text processing, also in network programming with Perl Scripting.
- Worked with installation, configuration, tuning, security, backup, recovery and upgrades of UNIX and Linux (Red Hat, Ubuntu, and SuSE)
- Creating, cloning Linux Virtual Machines, templates using VMware Virtual Client 3.5 and migrating servers between ESX hosts.
- Involved in migration activities using Redhat LVM, Solaris LVM, Veritas, and EMC Open Migrator.
- Installation of OAS (Oracle Application Server) on Solaris 9 and its configuration with oracle database.
- Writing Shell and Perl Scripting for automation of job.
- Tuning the kernel parameters based on the application/database requirement
- Used Veritas File system (VxFS) and Veritas Volume Manager (VxVM) to configure RAID 1 and RAID 5 Storage Systems on Sun Solaris.
- File system tuning, growing, and shrinking with Veritas File system 3.5/ 4.x.
- Installed and configured GFS cluster for holding databases.
- Configured open LDAP Red Hat Linux systems.
- Setup optimal RAID levels (fault tolerance) for protected data storage in NAS environments.
- Install and configure DHCP, DNS (BIND, MS), web (Apache, IIS), mail (SMTP IMAP and POP3) and file servers.
- Maintaining Remedy environments used for the ticketing system.
- Created new slices, mounted new file systems and uncounted file systems.
- Expertise in troubleshooting the systems and managing LDAP, DNS, DHCP and NIS.
Environment: Red Hat Linux (RHEL 3/4/5), Solaris 10, Logical Volume Manager, Sun & Veritas Cluster Server, VMWare, Global File System, Redhat Cluster Servers.