Big Data Engineer/spark Resume
PROFESSIONAL SUMMARY:
- Around 8 years of professional IT experience involving project development, implementation, deployment and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase. Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake. Designed, developed, and deployed Data Lakes, Data Marts, and Data warehouse using AWS cloud like AWS S3, AWS RDS, and AWS Redshift, terraform, Lambda, Glue, EMR, Step Function, CloudWatch events, SNS, Redshift, S3, IAM, etc.Others
- Working experience with Linux lineup like Redhat and CentOS.
- Experience in designing and building Data Management Lifecycle covering Data Ingestion, Data integration, Data consumption, Data delivery, and integration Reporting, Analytics, and System-System integration.
- Designed, developed, and deployed Data Lakes, Data Marts, and Data warehouse using AWS cloud like AWS S3, AWS RDS, and AWS Redshift, terraform, Lambda, Glue, EMR, Step Function, CloudWatch events, SNS, Redshift, S3, IAM, etc.
- Designed, developed, and deployed Data warehouse, and AWS Redshift, and applied my best practices.
- Proficient in Big Data environment and Hands-on experience in utilizing Hadoop environment components for large-scale data processing including structured and semi-structured data.
- Experience in using build/deploy tools such as Jenkins, Docker and OpenShift for Continuous Integration & Deployment for Microservices
- Strong experience with all phases including Requirement Analysis, Design, Coding, Testing, Support, and Documentation.
- Extensive experience in developing Microservices using Spring Boot, Netëix OSS (Zuul, Eureka, Ribbon, Hystrix) and followed domain driven design
- Extensive experience with Azure cloud technologies like Azure Data Lake Storage, Azure Data Factory, Azure SQL, Azure Data Warehouse, Azure Synapse Analytical, Azure Analytical Services, Azure HDInsight, and Databricks.
- Solid Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, and concepts, configuring the servers for auto-scaling and elastic load balancing.
- Experience as Azure Cloud Data Engineer in Microsoft Azure Cloud technologies including
Azure Data Factory(ADF), Azure Data Lake Storage(ADLS), Azure Synapse Analytics(SQL Data warehouse), Azure SQL Database, Azure Analytical services, Polybase, Azure Cosmos NoSQLDB, Azure Key vaults, Azure Devops, Azure HDInsight BigData Technologies like Hadoop, Apache Spark and Azure Data bricks
- Experience with monitoring the web services using Hadoop and Spark for controlling the applications and analyzing their operation and performance.
- Experienced in Python data manipulation for loading and extraction as well as with Python libraries such asNumPy, Pandas, and SciPy for data analysis and numerical computations.
- Good knowledge and experience with NoSQL databases like HBase, Cassandra, and MongoDB and SQL databases like Teradata, Oracle, PostgreSQL, and SQL Server.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Big Data - Hadoop (MapReduce & Hive), Spark (SQL, Streaming), Azure Cosmos DB, SQL
Datawarehouse, Azure DMS, Azure Data Factory, AWS Redshift, Athena, Lambda, Step Function and SQL
- Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight BigData Technologies (Hadoop and Apache Spark) and Data bricks.
- Experience in the development and design of various scalable systems using Hadoop technologies in various environments and analyzing data using MapReduce, Hive, and PIG.
- Hands-on use of Spark and Scala to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Strong knowledge in working with ETL methods for data extraction, transformation, and loading in corporate- wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Hands-on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie, andHBase.
- Strong experience with all phases including Requirement Analysis, Design, Coding, Testing, Support, and Documentation.
- Experience with different ETL tool environments like SSIS, Informatica, and reporting tool environments like SQL Server Reporting Services, and Business Objects.
- Experience in deployment of applications and scripting using the Unix/Linux Shell scripting.
- Solid knowledge of Data Marts, Operational Data Store, OLAP, Dimensional Data Modeling with Star Schema Modeling, Snow Flake Modeling for Dimensions Tables using Analysis Services.
- Extensive experience with various databases like Teradata, MongoDB, Cassandra DB, MySQL, Oracle, and SQL Server.
- Experience in Creating Teradata SQL scripts using OLAP functions like rank and rank over to improve the query performance while pulling the data from large tables.
- Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures, and functions.
- Knowledge and experience with Continuous Integration and Continuous Deployment using containerization technologies like Docker and Jenkins.
- Excellent working experience in Agile/Scrum development and Waterfall project execution methodologies.
TECHNICAL SKILLS:
BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka, Airflow, PySpark,, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB, Cosmos
Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML
Development Tools: Microsoft SQL Studio, IntelliJ,Azure Databricks, Eclipse, NetBeans.
Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader,PostgreSql, Talend,Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza, GraphDB
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
PROFESSIONAL EXPERIENCE:
Confidential
Big Data Engineer/Spark
Responsibilities:
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto the Hadoop ecosystem.
- Worked with business/user groups for gathering the requirements and working on the creation and development of pipelines.
- Migrated applications from Cassandra DB to Azure Data Lake Storage Gen 1 using Azure Data Factory, created tables, and loading and analyzed data in the Azure cloud.
- Worked on creating Azure Data Factory and managing policies for Data Factory and Utilized Blob storage for storage and backup on Azure.
- Worked on developing the process and ingested the data in Azure cloud from web service and load it to Azure SQL DB.
- Worked with Spark applications in Python for developing the distributed environment to load high volume files using Pyspark with different schema into Pyspark Data frames and process them to reload into Azure SQL DB tables.
- Designed and developed the pipelines using Databricks and automated the pipelines for the ETL processes and further maintenance of the workloads in the process.
- Worked on creating ETL packages using SSIS to extract data from various data sources like Access database, Excel spreadsheet, and flat files, and maintain the data using SQL Server.
- Worked with ETL operations in Azure Databricks by connecting to different relational databases using Kafka and used Informatica for creating, executing, and monitoring sessions and workflows.
- Worked on automating data ingestion into the Lakehouse and transformed the data, used Apache Spark for leveraging the data, and stored the data in Delta Lake.
- Ensured data quality and integrity of the data using Azure SQL Database and automated ETL deployment and operationalization.
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud
(Blob, Azure SQL DB, cosmos DB).
- Developed Json scripts for deploying the pipeline in Azure Data Factory that process the data using the Cosmos Activity
- Used Databricks, Scala, and Spark for creating the data workflows and capturing the data from Delta tables in Delta Lakes.
- Performed Streaming of pipelines using Azure Event Hubs and Stream Analytics to analyze the data from the data-driven workflows.
- Worked with Delta Lakes for consistent unification of Streaming, processed the data, and worked on ACID transactions using Apache Spark.
- Worked with Azure Blob Storage and developed the framework for the implementation of the huge volume of data and the system files.
- Implemented of distributed stream processing platform with low latency and seamless integration, with data and analytics services inside and outside Azure to build your complete big data pipeline.
- Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
- Designed and developed user defined function stored procedures, triggers for cosmos DB
- Worked with PowerShell scripting for maintaining and configuring the data. Automated and validated the data using Apache Airflow.
- Worked on optimization of Hive queries using best practices and right parameters and using Hadoop, YARN, Python, and Pyspark.
- Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders
- Used Sqoop to extract the data from Teradata into HDFS and export the patterns analyzed back to Teradata.
- Worked on Kafka to bring the data from data sources and keep it in HDFS systems for filtering.
- Used Accumulators and Broadcast variables to tune the Spark applications and to monitor the created analytics and jobs.
- Tracked Hadoop cluster job performance and capacity planning and tuning Hadoop performance for high availability and Hadoop cluster recovery.
- Worked with Tableau for generating reports and created Tableau dashboards, pie charts, and heat maps according to the business requirements.
- Worked with all phases of Software Development Life Cycle and used agile methodology for development.
Confidential, Charlotte, CT
Hadoop developer
Responsibilities:
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
- Experience in Integrating Apache Kafka with and created Kafka pipelines for real time processing.
- Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies
- Worked on various diversified Enterprise Applications concentrating in Confidential as a Software Developer with good understanding of Hadoop framework and various data analyzing tools
- Review and modify CI/CD principles, iteratively.
- The primary roles and responsibilities of a DevOps team are to communicate effectively, improve visibility across the CI/CD pipeline and constantly learn new things. A drive for continuous improvement will be at the core of any efficient DevOps organization.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Excellent Programming skills at a higher level of abstraction using Scala, Java and Python.
- Experience in using D-Streams, Accumulator, Broadcast variables, RDD caching for Spark Streaming.
- Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib,Spark Streaming and Spark SQL.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Good experience in creating and designing data ingest pipelines using technologies such as Apache Storm- Kafka.
- Experienced in working with in-memory processing framework like Spark Transformations, SparkQL, MLib and Spark Streaming.
- Expertise in creating Custom Serdes in Hive.
- Good working experience on using Sqoop to import data into HDFS from RDBMS and vice-versa.
- Experienced in implementing POC using Spark Sq| and Mlib libraries.
- Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
- Hands on experience in handling Hive tables using Spark SQL.
- Efficient in writing MapReduce Programs and using Apache Hadoop API for analyzing the structured and unstructured data.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Extending Hive and Pig core functionality by writing custom UDFS.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Good working knowledge on NOSQL databases such as Hbase, MongoDB and Cassandra.
- Used Hbase in accordance with PIG/Hive as and when required for real time low latency queries
- Knowledge of job workflow scheduling and monitoring tools like Oozie (hive, pig) and Zookeeper (Hbase).
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs, Hbase and Hive by integrating with Storm
- Developed various shell scripts and python scripts to address various production issues.
- Developed and designed automation framework using Python and Shell scripting
- Generated java APIs for retrieval and analysis on No-SQL database such as HBase and Cassandra.
- Good Knowledge of data compression formats like Snappy, Avro.
- Developed automated workflows for monitoring the landing zone for the files and ingestion into HDFS in Bedrock Tool and Talend.
- Created Talend Jobs for data comparison between tables across different databases, identify and report discrepancies to the respective teams.
- Delivered zero defect code for three large projects which involved changes to both front end (Core Java, Presentation services) and back-end (Oracle).
- Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
- Also have experience in understanding of existing systems, maintenance and production support, on technologies such as Java, J2EE and various databases (Oracle, SQL Server).
Confidential
Data Engineer
Responsibilities:
- Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
- Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records.
- Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
- Worked on developing CFT’s for migrating the infra from lower environment to higher environment.
- Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
- Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
- Experience in using the AWS services Athena, Redshift and Glue ETL jobs.
- Integrated Big Data Spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily.
- Created S3 bucket structure and Data Lake layout for optimal use of glue crawlers and S3 buckets.
- Used Hive Glue data catalog to obtain and validate schema of data and lake formation for data governance.
- Involved in loading data from AWS S3 to Snowflake and processed data for further analysis.
- Developed Analytical dashboards in Snowflake and shared data to downstream.
- Developed Spark Applications by using Scala and Python and Implemented Apache Spark data processing project to handle data from various RDBMS and streaming sources.
- Used Spark and PySpark for streaming and batch applications on many ETL jobs to and from data sources.
- Developed new API Gateway for streaming to Kinesis and ingestion of event streaming data.
- Worked on building data centric queries to cost optimization in Snowflake.
- Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
- Experience in using Spark SQL to handle structured data from Hive in AWS EMR Platform.
- Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Written unit test cases for Spark code for CICD process.
- Good knowledge about the configuration management tools like BitBucket/Github and Bamboo(CICD).
- Developed spark jobs on databricks to perform tasks like data cleansing, data validation, standardization and then applied transformations as per use case
- Experience in analyzing data from azure data storages using databricks for deriving insights using spark cluster capabilities
Confidential
Java/J2EE Developer
Responsibilities:
- Followed Agile methodology meetings to track, optimize and developed sequence diagrams depicting method interactions, using MS Visio.
- Conducted analysis of organizational needs and goals for the development and implementation of application systems by involving with business personnel.
- Developed application using Spring MVC, JSP, JSTL and AJAX on the presentation layer, the business layer is built using Spring and the persistent layer uses Hibernate.
- Data Operations were performed using Spring ORM wiring with Hibernate and Implemented Hibernate Template and criteria API for Querying database.
- Developed various J2EE components like SAX, XSLT, JAXP, JNDI, LDAP, JMS, MQ Series.
- Used AJAX in suggestive search and to display dialog boxes with JSF and DOJO for front-end applications.
- Implemented all the components of Spring Framework (Controller classes, Spring Bean Configuration file (dispatcher-servlet.xml).
- Developed Web Services using XML messages that use SOAP. Developed Web Services for Payment Transaction and Payment Release.
- Used WSDL and SOAP protocol for Web Services implementation.
- Worked in Struts framework based on MVC Architecture.
- Wrote stored procedures, SQL scripts in Oracle for Data Accessing and manipulation.
- Compiled and built the application using ANT scripts and deployed the application.
- Configured and created applications log files using Log4j.
- Actively involved in code reviews and bug fixing.
- Participated in the status meetings and status updates to the management team.