Data Engineer Resume
Dania Beach, FL
SUMMARY
- Around 8 years of professional experience involving project development, implementation, deployment, and maintenance.
- Proficient in Software Development Life Cycle (SDLC), project Management methodologies, and Microsoft SQL Server database management.
- Experience on Big Data technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Zookeeper.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing and optimizing teh HiveQL queries.
- Strong Hadoop and stage uphold involvement in major Hadoop Distributions like Cloudera, Hortonworks, Amazon EMR, and Azure HDInsight.
- In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principals.
- Expertise in writing Hadoop Jobs to analyze data using MapReduce, Hive, Pig, and Splunk.
- Hands on experience in working with Ecosystems like Hive, Sqoop, Spark, MapReduce, Flume, Oozie.
- Worked on data processing, transformations, and actions in spark by using Python (PySpark) language.
- Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
- Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using PL/SQL.
- Experience with NoSQL databases and integration with Hadoop cluster like HBase, Cassandra, MongoDB, DynamoDB and Cosmos DB.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Setup full CI/CD pipelines so dat each commit a developer makes will go through standard process of software lifecycle and is tested well enough before it can make it to teh production.
- Experience in writing Shell scripts in Linux OS and integrating them with other solutions.
- Strong experience in Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export.
- Working experience on Data ingestion tools like Apache NiFi also data loading into Common data Lake using HiveQL’s.
- Extensive experience in developing applications dat perform Data Processing tasks using Teradata and SQL Servers.
- Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Experienced working on Kafka services to manage and set up Kafka clusters to build a pipeline.
- Strong understanding of Data Modeling and ETL process in data warehouse environment such as star schema, snowflake schema.
- Involved in setting upJenkins Masterand multiple slaves for teh entire team as a CI tool as part ofContinuous development and deployment process.
- Experience on various projects like Data Lake, Application migrations, Cloud migrations, and Automation projects for various clients.
- Experienced working on cloud-based technologies like AWS, which includes Elastic Map Reduce (EMR), EC2, S3, Cloud monitoring and different databases provided by AWS to manage application end to end.
- Experience on Migrating SQL database toAzure Data Lake, Azure data Lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
- Experience on Azure cloud segments (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB).
- Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka Confluent, Storm, Flume and Sqoop.
- Experience in Informatica mapping specification documentation, tuning mappings to increase performance, proficient in creating and scheduling workflows and expertise in Automation of ETL processes with scheduling tools such as Autosys and Tidal.
- Strong understanding of Data Modeling and ETL process in data warehouse environment such as star schema, snowflake schema.
- Extensive experience in Tableau Desktop reporting features like Measures, Dimensions, Folder, Hierarchies, Extract, Filters, Table Calculations, calculated fields, Sets, Groups, Parameters, Forecasting, Blending and Trend Lines.
- Extensive knowledge on Version control tools (Git, GitHub, Gitlab) and Incident/Defect tracking tools (Service Now).
- Scheduled teh jobs using Airflow scheduler.
- Team Player, quick learner and self-starter with effective communication, motivation and organizational skills combined with attention to details and business process improvements.
- Proactive, Positive-minded, quick learner and highly passionate towards work.
- Excellent analytical and communication skills and ability to work independently with minimal supervision and perform as part of a team.
- Enthusiastic about learning new concepts in emerging technology and apply them suitably.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, Kafka, Flume, Cassandra, Impala, Oozie, Zookeeper, EMR, Apache Spark
Cloud Technologies: AWS (Amazon Web Services), MS Azure.
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter, Visual studio, Net Beans, My SQL, SQL Developer, Workbench, Tableau
OperatingSystems: Windows, Linux, Unix
Programming languages: Python, Scala, Linux shell scripts, PL/SQL, Java, PySpark, Pig, Hive
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE, MongoDB, Cassandra, Teradata, MS Access
Java&J2EE Technologies: Core Java, Servlets, JSP, JDBC, Java Beans
Business Tools: Tableau, Power BI, Crystal Reports, Dashboard Design
PROFESSIONAL EXPERIENCE
Confidential - Dania beach, FL
Data Engineer
Responsibilities:
- Teh project is mainly focused on Retail data, Migrating to AWS cloud from on-premises SQL Database and MongoDB.
- We built a pipeline (True Source) to handle huge amount of data which comes through our internal and external vendors were pushed mainly Retail data to load directly to teh cloud.
- Leveraged AWS cloud services such as EC2, auto-scaling and VPC to build secure, highly scalable, and flexible systems dat handled expected and unexpected load bursts.
- Worked on Databricks for transforming Datasets and RDD’s into registered table data with teh halp of PySpark.
- Worked on Spark SQL API for narrow, wide Transformations and aggregations in Databricks Environment.
- Experience on mounting data sources like Amazon S3 and Azure Blob Storage on Databricks Platform.
- Setup Databricks Enterprise Platform environment created cross-account role in AWS for Databricks to provision Spark clusters.
- Design and architect various layer of Data Lake.
- Consumed XML messages using Kafka and processed teh xml file using Spark streaming to capture UI updates.
- I has Review existing Hortonworks Data Platform (HDP) and Hortonworks Data Flow (HDF) clusters in AWS Cloud and perform NiFi performance tuning to improve teh performance and stability.
- Extensively worked on Jenkins to implement continuous integration (CI) and Continuous deployment (CD) processes.
- Developed custom Jenkins’s jobs/pipelines dat contained Bash shell scripts utilizing teh AWS CLI to automate infrastructure provisioning.
- Worked with structured/semi-structured data ingestion and processing on AWS using S3, Python and Migrate on-premises big data workloads to AWS.
- Wrote Python scripts to parse XML documents and load teh data in database.
- Evaluated, extracted, and transformed data for analytical purpose within teh context of spark streaming environment.
- Developed spark application by using Python (PySpark) to transform data according to business rules with in-depth understanding of Spark Architecture including Spark Core API’s, Spark SQL, and Data Frame -API’s.
- Used Spark-SQL extensively to load data into spark cluster and written queries to fetch data from these tables.
- Created data frames with PySpark SQL, loaded with data, and written PySpark.SQL queries.
- Generated/ Created SQL Scripts in Oracle and SQL Server to pull teh data to snowflake.
- Worked extensively with data migration, data cleansing, data profiling.
- Experienced in performance tuning of Spark Applications for setting correct level of Parallelism and memory tuning.
- Provide guidance to development team working on PySpark as ETL platform.
- Analyzed teh SQL scripts and designed it by using PySpark SQL for faster performance.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, utilizing Broadcast variables in Spark and performing effective & efficient joins.
- Extensive expertise using teh core Spark APIs and processing data on an AWS EMR cluster.
- Developed Python scripts to manage AWS resources from API calls and worked with AWS CLI.
- Responsible for launching Amazon EC2 cloud instances using AWS services and configuring launched instances with respect to specific application and regions.
- Experience in working with EC2 Container Service plugin inJENKINSwhich automates teh Jenkins master- slave configuration by creating temporary slaves.
- Experience in deploying EMR, S3 bucketing, m3. xlarge, c3.4xlarge, IAM Roles & Cloud watch logs.
- Developed Python scripts to transfer files between cross-region S3 buckets.
- Good familiarity with AWS services like Dynamo DB, Redshift, Simple Storage Service (S3), and Amazon Elastic Search Services.
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflake’s SnowSQL.
- Worked on complex SNOW SQL and Python Queries in Snowflake.
- Developed Oozie work processes for planning and arranging teh ETL cycle. Associated with composing Python scripts to computerize teh way towards extricating weblogs utilizing Airflow DAGs.
- Developed complex data models in Snowflake to support analytics and self- service dashboard.
- Expertise in SQL queries created user defined aggregated function worked on advanced optimization techniques.
- Export teh analyzed data to all teh relational databases using Talend for visualization and to generate reports.
Environment: Apache Spark, Python, PySpark, AWS, EC2, S3, Dynamo DB, Redshift, Elastic search, Apache Kafka, Airflow, XML, SQL, PostgreSQL, MySQL, Talend, MongoDB, PuTTY.
Confidential - Columbus, IN
Hadoop Developer
Responsibilities:
- Created Pipelines inADFusingLinked Services to Extract, Transform, and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse.
- Developed Spark applications usingSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
- Migration of on-premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store (ADLS) using Azure Data Factory (ADF V1/V2).
- Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh SQL Activity.
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Recreating existing application logic and functionality in teh Azure Data Lake, Data Factory, SQL Database and SQL data warehouse environment. experience in DWH/BI project implementation using Azure DF.
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with teh whole Microsoft stack.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
- Used Data Frames in Spark for converting teh distributed collection of data organized into named columns, developing predictive analytic using Apache Spark.
- Designed conceptual data models by analyzing teh data requirements needed to support teh business processes.
- Developed python scripts for data analysis and extracting important metrics from streaming / log data to ensure teh successful daily load of raw data to teh server.
- Worked on Databricks for ETL Data Cleansing, Integration & Transformation using python scripts for managing data from disparate sources.
- Worked on Migrating data from traditional On-premises SQL Databases to Azure SQL Data warehouse.
- Worked on Big Data Integration &Analytics based on Hadoop, Spark, Kafka clusters and web Methods.
- Performed data analysis on large volumes of data to identify duplications, data anomalies, missing data, etc.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Expertise in setup High Availability and Recoverability of databases using with SQL Server technologies includingAlways On Azure VM.
- Worked on Talend ETL and used features such as Context variables, Database components like tMSSQLInput, tOracleOutput, file components, ELT components etc.
- Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS).
- Responsible for managing and supporting Continuous Integration (CI) using Jenkins.
- Created Talend jobs to copy teh files from one server to another and utilized Talend FTP components.
- Worked on created multipleHivetables, implemented partitioning, dynamic partitioning, and buckets inHivefor efficient data access.
- Experienced in migrating HiveQL to minimize query response time.
- Experienced in handling different types of joins in Hive.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
- Utilized machine learning algorithms such aslinear regression, multivariate regression, PCA, K-means, & KNNfor data analysis.
- Provided 24x7 on-call database support and Backup databases to test teh integrity of teh backups.
- Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of teh GIT Repositories, and teh access control strategies.
Environment: Hadoop 2.6.0, Cloudera cdh5.11.2, HDFS, MapReduce, Sqoop 1.4.6, Hive 1.1.0, Spark 1.6.0, Scala 2.10.5, Flume 1.6.0, Kafka 2.0.2, Oracle 12.1.0.1, JIRA v7.4.1, Talend.
Confidential
Hadoop Developer
Responsibilities:
- Importing and exporting data usingSqoopfromHDFSto Relational Database Systems and vice-versa.
- Worked on installing clusters, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop, MapReduce, HDFS, HBase, Hive, Sqoop, Pig.
- Gained Good Exposure on Apache Hadoop,Map Reduceprogramming, PIG Scriptingand Distribute Application andHDFS.
- Worked on migratingMapReduce programsintoSparktransformations usingSparkandScala, initially done usingpython (PySpark).
- Used spark SQL to load data and created schema RDD on top of dat which loads into hive tables and handled structured using spark SQL.
- Good Knowledge onHadoopClusterarchitectureand monitoring teh cluster.
- WrittenHive queriesfor data analysis to meet teh business requirements.
- CreatingHive tablesand worked on them usingHive QL.
- Participating in development/implementation of Cloudera Hadoop environment.
- Got good experience withNOSQLdatabase like HBase, Cassandra.
- Explored with theSparkimproving teh performance and optimization of teh existing algorithms inHadoopusingSparkContext,Spark -SQL, Data Frame,Pair RDD's,Spark, YARN, PySpark.
- UsedOozieworkflow engine to manage interdependentHadoopjobs and to automate several types ofHadoopjobs such as Java map-reduceHive, Pig, andSqoop.
- Performed data validation with Redshift and constructed pipelines designed over 100TB per day.
- Worked with business users on teh new Tableau versions features and explaining self-service capabilities.
- Designed & Developed logical & physical data model using data warehouse methodologies.
- Created Summary and detail dashboards for identifying mismatch of teh data in Source and reporting systems using Tableau Desktop.
- Performed Data profiling, preliminary data analysis and handle anomalies such as missing, duplicates, outliers, and imputed irrelevant data.
Environment: Hadoop, HDFS, Map Reduce, Hortonworks, Hive, Sqoop, Python, Unix, Shell Scripting, Spark SQL, Oozie, Kafka, Scala, HBase, Pig, PySpark, Cassandra, Cloudera, YARN, Tableau, Redshift, DWH.
Confidential
SQL Developer
Responsibilities:
- Generated SQL and PL/SQL scriptsto install create and drop database objects, including tables, views, primary keys, indexes, constraints, packages, sequences, grants and synonyms.
- Tune SQL statements using hints for maximum efficiency and performance, create and Maintain/modifyPL/SQL Packages, mentor others with teh creation of complexSQL Statements,perform data modeling and create/maintain and modify complex databaseTriggers and Data Migrationscripts.
- Constructed and implemented multiple-table links requiring complex join statements, includingOuter-JoinsandSelf-Joins.
- Created, debugged, and modifiedStored Procedures, Triggers, Tables, ViewsandUser-Defined Functions.
- DevelopedUNIXShell scripting for job automation and daily backup.
- Writing SQL queries,Cursorsusing embedded SQL, PL/SQL.
- Testing and debugging of teh applications.
- Separating of tables and indexes on different locations. Reducing Disk, I/O Contention.
- Involved in teh development of newProcedures, Functions, Packages, Triggersand updating teh old ones based on teh change requests.
- Involved in teh analysis, design, flow, archiving of data and databases and with their relationships.
- Created Documentation of development code and teh test cases involved.
- Worked extensively onException Handling,to trouble-shoot PL/SQL code.
Environment: Oracle 11g, SQL * Plus, TOAD, SQL*Loader, SQL Developer, Shell Scripts, UNIX, Windows XP.