Sr Data Engineer Resume St. Louis, MO - Hire IT People

SUMMARY

8 years of experience in IT Industry in the Big data platform having extensive hands - on experience in Apache Hadoop ecosystem and enterprise application development.
Experience on BIG DATA using HADOOP framework and related technologies such as HDFS, HBASE, Map Reduce, Spark, Strom, Scala, Kafka, HIVE, PIG, FLUME, OOZIE, POSTGRES, SQOOP, TALEND, IMPALA and ZOOKEEPER.
Experience in Big data ecosystems using Hadoop, Impala, Airflow, Snowflake, Teradata and Oozie.
Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight.
Experience working within the Agile, Scrum and Waterfall Methodologies.
Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
Experience with Apache Spark ecosystem using Spark-Core, PySpark, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
Experience in developing ETL applications on large volumes of data using different tools MapReduce, Spark-Scala, PySpark, Spark-SQL and Pig.
Experience in developing Spark Programs for Batch and Real-Time Processing. Developed Spark Streaming applications for Real Time Processing.
Having experience in developing a data pipeline using Kafka to store data into HDFS.
Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
Expertise in using various Hadoop infrastructures such as Map Reduce, Pig, Hive, Zookeeper, HBase, Sqoop, Oozie, Flume, Drill and spark for data storage and analysis.
Experienced in Identifying improvement areas for systems stability and providing end high availability architectural solutions.
Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
Experience in working with the visualization tools like Tableau, PowerBI.
Experienced in developing and designing Web Services (SOAP and Restful Web services).
Experience in web-based UI development using jQuery UI, jQuery, Ext JS, CSS, HTML, XHTML and JavaScript.
Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
Experience in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
Hands on experience with AWS components like VPC, EC2, EBS, RedShift, CFT.
Experience in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, Avro Parquet files.
Experience in various databases such as MySQL, SQL, DB2, Oracle, NoSQL- MongoDB, Cassandra, Dynamo DB, Red Shift, HBase.
Experience in tracking and logging end to end software application build using Azure Devops.
Experience in creating pipeline jobs, schedule triggers using Azure data factory.
Experience on Google Cloud Platform (GCP) in all the big data products big query, Cloud DataProc, Google Cloud Storage, Composer (Air Flow as a service).
Hands on experience in Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
Experienced in managing Hadoop clusters and services using Cloudera Manager.
Experienced in running query - using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
Experience in developing Hadoop Architecture in Windows and Linux platforms.
Experience working with GitHub/Git source and version control systems, ANT, Maven and Jenkins as Build tools.

TECHNICAL SKILLS

Hadoop Distributions: Cloudera, AWS EMR and Azure Data Factory.

Languages: Scala, Python, SQL, Hive QL, KSQL.

IDE Tools: Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

AWS Services: VPC, IAM, S3, Elastic Beanstalk, CloudFront, Redshift, Lambda, Kinesis, DynamoDB, Direct Connect, Storage Gateway, EKS, DMS, SMS, SNS, and SWF

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Machine Learning And Statistics: Regression, Random Forest, Clustering, Time-Series Forecasting, Hypothesis, Explanatory Data Analysis

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Other Software: Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel

Frameworks: Django, Flask, WebApp2

PROFESSIONAL EXPERIENCE

Confidential, St. Louis, MO

Sr Data Engineer

Responsibilities:

Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
Utilized Agile Methodology to help manage and organize a team with regular code review sessions.
Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
Worked on creating HIVE and HBASE tables (Using HBASE integration) on the imported data based on the Line of Business (LOB).
Used Hive Queries in Spark-SQL for analysis and processing the data.
Created and implemented various shell scripts for automating the jobs.
Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
Worked with Nifi for managing the floe of data from source through data flow.
Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Implemented pipeline using PySpark and also used Talend components. Design, develop
Design, develop and deploy convergent meditation platform data collection and billing process using Talend ETL.
Solved performance issues in Hive and Pig Scripts with understanding of Joins, Groups and aggregation and how does it translate to MapReduce jobs.
Developed streaming pipelines using Apache Spark with Python.
Assisted users in creating/modifying worksheets and data visualization dashboards in Tableau.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
Automating the data flow using Nifi, Accumulo and the Control M.
Build a program with Python sdk with Apache beam framework and execute it in Cloud Dataflow to stream pub sub messages into big query tables.
Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
Brought data from various sources into Hadoop and Cassandra using Kafka.
Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
Queried and analyzed data from NoSQL data base Cassandra for quick searching, sorting and grouping through CQL.
Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run MapReduce jobs in the backend.
Using ETL (SSIS) to develop jobs for extracting, cleaning, transforming and loading data into data warehouse.
Experience in utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, Dynamo DB, Elastic Cache Glacier, Cloud Watch, Cloud Front, Kinesis, Redshift, SQS, SNS, RDS.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
Created data ingestion modules using AWS Glue for loading data in various layers in S3 and reporting using Athena and Quick sight. knowledge of Web/Application Servers like Apache Tomcat Oracle and WebLogic.
Proficient in Nifi and workflow scheduler managing Hadoop actions with control flows.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.
Used Jenkins pipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR.
Developed ETL’s using PySpark. Used both Data frame API and Spark SQL API.
Implemented test scripts to support test-driven development and continuous integration.
Worked with Oozie and Zookeeper to manage the flow of jobs and coordination in the cluster.
Written multiple file formats including XML, JSON &other compressed file formats.
Used CI/CD tools for daily build and deploys of Jenkins.
Transferred data between Pig Scripts and Hive using H Catalog, transferred relational database using Sqoop.
Used Git as the version control tool.
Used Jenkins scheduler to schedule the ETL workflows.
Development of JUnit test cases to test business components. understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.

Environment: Python, Hadoop, Spark, ETL, HDFS, Hive, Pig, PySpark, HBase, Big Data, Tableau, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, SQL, NoSQL, Cassandra, Scala, Linux, NoSQL, MySQL Workbench, Java, Eclipse, Oracle, Git, Jenkins.

Confidential, Atlanta, GA

Sr. Data Engineer/Python Developer

Responsibilities:

Worked in Agile environment and used rally tool to maintain the user stories and tasks.
Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements.
Implemented Composite server for the data virtualization needs and created multiples views for restricted data access using a REST API.
Used Cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
Involved in importing data from MS SQL Server, MySQL and Teradata into HDFS using Sqoop.
Collected the log data from Web Servers and integrated into HDFS using Flume.
Involved in loading data into HBase NoSQL database like MongoDB.
Developed scripts using PySpark to push the data from GCP to the third-party vendors using their API framework.
Ingested streaming data with Apache Nifi in to Kafka.
Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and Non relational to meet business functional requirements.
Worked on the backend using Scala and Spark to perform several aggregation logics.
Created and implemented highly scalable and reliable highly scalable and reliable distributed data design using NoSQL HBase.
Developed business intelligence solutions using SQL server data tools and load data to SQL & Azure Cloud databases.
Install and configure Apache airflow for S3 bucket and snowflake data warehouse and created dags to run the airflow.
Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
Ingested huge volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory.
Designed and developed user defined functions, stored procedures, triggers for Cosmos DB.
Implemented Azure, self-hosted integration runtime in ADF.
Used SQL Azure extensively for database needs in various applications.
Developed ELT processes from the files from abinitio, google sheets in GCP with compute being data prep, dataproc (pyspark) and Big query.
Created Linked service to land the data from SFTP location to Azure Data Lake.
Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
Used MapReduce programs for data cleaning and transformations and load the output into the Hive tables in different file formats.
Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.
Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
Loading data on incremental basis to big query raw, Google Data Proc, GCS bucket, hive, Spark, Scala, Python, gsutil and Shell Script.
Created data maps in Informatica to extract data from Sequential files.
Worked on using presto, hive, spark-sql, bigquery using python client libraries and building interoperable and faster programs for analytics platforms.
Working knowledge in working around Kubernetes in GCP, working on creating new monitoring techniques using the stack driver’s log router and designing reports in data studio.
Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
Worked on extending Hive and Pig core functionality by writing custom UDFs using Java.
Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
Working experience on RDD’s & Data frames (SparkSQL) using Pyspark for analyzing and processing the data.
Converted SAS code to python/spark-based jobs in cloud dataproc/big query in GCP.
Moved Data between big query and Azure Data Warehouse using ADF and created Cubes on AAS with lots of complex DAX language for memory optimization for reporting.
Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
Developed shell scripts to generate the hive create statements from the data and load data to the table.
Involved in writing custom Map-Reduce programs using java API for data processing.
The Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
Worked with the build tools like Maven and the version control tools like Git.
Implemented test scripts to support test-driven development and continuous integration.

Environment: Azure, GCP, Python, Hadoop, Spark, ETL, PySpark, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, MongoDB, Scala, Linux, NoSQL, MySQL Workbench, Java, Eclipse, Oracle, Git, Maven.

Confidential, Monroe, LA

Software Engineer

Responsibilities:

Working experience with Agile and SCRUM methodologies.
Used Yarn Architecture and Map reduce in the development cluster for POC.
Used REST services for handling unfinished jobs, knowing the status and creating a dataset inside a URL.
Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
Cluster coordination services through Zookeeper.
Integrated Cassandra as a distributed persistent metadata store to provide metadata resolution for network entities on the network.
SQL Server reporting services (SSRS). Created & formatted Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
Creating complex ETL packages using SSIS to extract data from staging tables to partitioned tables with incremental loads.
Using the Spark framework Enhanced and optimized product Spark code to aggregate, group and run data mining tasks.
Developed and deployed Apache Nifi flows across various environments optimized data flows.
Analyzed the web log data using the HiveQL.
Defined job work flows as per their dependencies in Oozie.
Played a key role in productionizing the application after testing by BI analysts.
Given POC of FLUME to handle the real-time log processing for attribution reports.
Maintain System integrity of all sub-components related to Hadoop.
Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes Snow SQL Writing SQL queries against Snowflake.
Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
Developed report using Tableau which keeps track of the dashboards published to Tableau Server, which help us find the potential future clients in the organization.
Involved in integrations among Pig, Hive and HBase.
Designing ETL architecture and developed capacity-planning and trending-data warehousing solution to load data from multiple sources into Data warehouse.
Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.
Carried out data transformation and cleansing using SQL queries, Python and PySpark.
Responsible for analyzing and data cleansing using Spark SQL queries.
Used Hive Queries in Spark-SQL for analysis and processing the data.
Designed and developed ETL process in AWS Glue to migrate usage data from S3 data source to redshift.
Operating the cluster on AWS by using EC2, VPC, RDS, EMR, S3 and CloudWatch.
Implementations of generalized solution model using AWS Sage Maker.
Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Athena, AWS Glue, Cloud trail, Docker and SNS.
Written Spark programs to model data for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
Strong understanding of AWS components such as EC2 and S3.
Worked on AWS Elastic Beanstalk for fast deploying of various applications developed with Java, PHP, Node.js, Python on familiar servers such as Apache.
Converted previously written SAS programs into python for one the ETL project.
Developed ETL jobs using Spark -Scala to migrate data from Oracle to new Cassandra tables.
Worked with PL/SQL procedures and used them in Stored Procedure Transformations.
Compiled data from various sources to perform complex analysis for actionable results.
Implemented a CI/CD pipeline with Git Hub and AWS.
Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI.
Built scripts using MAVEN that compiles the code, pre-compiles the JSP's, built an EAR file and deployed the application on the WebSphere application server.
Installed and configured Hive and written Hive UDFs and Used Map Reduce and Junit for unit testing.

Environment: Python, AWS, Hadoop, ETL, MapReduce, Spark, PySpark, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Scala, Kafka, Flume, Zookeeper, MapReduce, Spark SQL, Cassandra, Tableau, Scala, Unix, REST, Java, Maven, Git.

Confidential

Python Developer

Responsibilities:

Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and Agile methodologies.
Implemented Proof of concepts for SOAP & REST APIs.
Process and load bound and unbound Data from Google pub/sub topic to Big query using cloud Dataflow with Python.
Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Involved in dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
Worked on implementing Spark using Scala and Spark SQL for faster analyzing and processing of data.
Involved in working with Maintaining the Hadoop cluster Administration using Cloud era.
Strong understanding of AWS components such as EC2, Cloud Watch, S3 and AWS Glue.
Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location.
Implemented Spark using Python and Spark SQL for faster testing and processing of data.
Worked on the backend using Python and Spark to perform several aggregation logics.
Prepared the ETL specifications, Mapping document, ETL framework specification.
Those WIFI data through EMS/JMS get stored in Hadoop ecosystem and t Developed Hive and PIG scripts along with UDFs to achieve functionality on AWS EMR.
Extensively worked on oracle and SQL server. Wrote complex SQL queries to query ERP system for data analysis purpose.
Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Hadoop and MongoDB.
Develop near real time data pipeline using spark.
Extensively worked on Informatica PowerCenter.
Developing python programs that can run the end-to-end data migration and as well as transformation and load data into sinks such as oracle and MySQL.
Developed simple/complex MapReduce jobs using Hive and Pig.
Transported data to HBase using Flume.
Worked with deployments from Dev to UAT, and then to Prod.
Involved in understanding the Requirements of the End Users/Business Analysts and Developed Strategies for ETL processes.
Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster.
Developing the Pig scripts / UDFs to manipulate/transform the loaded data.
Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling.
Worked with Reporting developers to oversee the implementation of reports/dashboard designs in Tableau.
Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort.
Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines.
Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library.
Involved with development of Ansible playbooks with Python and SSH as wrapper for management of AWS node configurations and testing playbooks on AWS instances.
Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
Used Jenkins, Git Stash, Ansible like CI/CD tools to make daily builds and deploys.
Created Ad hoc Oracle data reports for presenting and discussing the data issues with Business.
Worked on version control tools like Git and build tools like Maven.
Developed JUnit test cases using Junit and Mockito for unit test of all developed modules.

Environment: Python, AWS, ETL, Hadoop, Spark, PySpark, Tableau, HDFS, Hive, Pig, HBase, Kafka, Big Data, Oozie, Sqoop, Zookeeper, MapReduce, Scala, SQL, MongoDB, Linux, NoSQL, MySQL, Java, SQL.

Confidential

Software Analyst

Responsibilities:

Gathered business requirements and prepared technical design documents, target to source mapping document, mapping specification document.
Worked on Hadoop (Big Data) cluster administration using Cloud era Manager with it ecosystem (like Pig, Hive, HBase, Map reduce, Zookeeper etc.) deployment and management.
Worked in Agile environment and used rally tool to maintain the user stories and tasks.
Coordinated design reviews, ETL code reviews with teammates.
Involved in developing Web Services using SOAP for sending and getting data from external interface. worked on Views, Stored Procedures, Triggers and SQL queries and for loading the data (staging) to enhance and maintain the existing functionality.
Enhance and upgrade the documentation related to the Kafka.
Logical implementation and interaction with HBase.
Efficiently put and fetched data to/from HBase by writing MapReduce job.
Developed Map Reduce jobs to automate transfer of data from/to HBase.
Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, S3.
Developed Python scripts to create data files from database and post them to FTP server on daily bases using windows task scheduler.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
Wrote PL/SQL queries and Stored procedures for data retrieval.
Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI.
Developed ANT script for building and packaging J2EE components.
Worked on version control tools like SVN and Build tools like Maven.
Writing UNIX shell scripts to support and automate the ETL process.
Extensively worked on UNIX Shell Scripting for splitting group of files to various small files and file transfer automation.
Involved in deploying and executing the code in oracle. And also helped in testing

Environment: Hadoop, ETL, Python, Hive, HBase, pig, Scala, Spark, PySpark, Apace, Map reduce, Cloud era, SOAP, Agile, Oracle, Unix, SQL, SOAP, Kafka, PL/SQL, Junit, ANT, Maven, SVN,

We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

St Louis, MO

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship